On MySQL InnoDB Row Formats and Compression

I have spent some time researching on InnoDB row formats and InnoDB compression. This article is about things that I have learned. InnoDB file format and row format There are 2 file formats, Antelope and Barracuda, with the latter being the newer. Antelope supports row formats Redundant and Compact, whereas Barracuda supports 2 more, Dynamic and Compressed. Dynamic vs Compact Compact is the default row format in MySQL 5.6. The difference between Compact and Dynamic is the handling of BLOBs (including BLOB, TEXT, VARCHAR, etc).

More on Binary Search: Variants

In the last binary search article, we discussed how to write a correct binary search. But sometimes we are not asked to find any position of the target. If we need the first or the last occurrence, can we do it? The classic binary search In Python, the basic binary search looks like this: lo = 0 hi = N-1 while (lo < hi): mid = (lo+hi)//2 if A[mid] == target: return mid elif A[mid] < target: lo = mid + 1 else: hi = mid return lo if A[lo] == target else -1 And the invariant of the algorithm is that A[lo] <= target <= A[hi].

Fixing ProxySQL Idle Threads Epoll Hang Heisenbug

Background My gevent application hangs after issuing a query that yields 1000+ rows and then a COMMIT. After fixing mysqlclient (see Python mysqlclient Doesn’t Work Well with gevent), the COMMIT does not hang the whole application, but it still hangs the greenlet that issues the SQL query. Setup: Ubuntu 16.04, ProxySQL v1.4.13, Amazon RDS MySQL 5.6 (ProxySQL backend), Python 2.7.12 application running on the same machine as ProxySQL Symptoms It is always reproducible with the specific query, the backend, and on that machine.

Python mysqlclient Doesn't Work Well with gevent

mysqlclient (MySQLdb fork) is a Python MySQL driver which uses libmysqlclient (or libmariadbclient). This means speed and stability. Does it work well with gevent? Not really. In mysqlclient <= v1.3.13, there’s a sample called waiter_gevent.py which looks like this: from __future__ import print_function """Demo using Gevent with mysqlclient.""" import gevent.hub import MySQLdb def gevent_waiter(fd, hub=gevent.hub.get_hub()): hub.wait(hub.loop.io(fd, 1)) def f(n): conn = MySQLdb.connect(user='root', waiter=gevent_waiter) cur = conn.cursor() cur.execute("SELECT SLEEP(%s)", (n,)) cur.execute("SELECT 1+%s", (n,)) print(cur.

Fixing boto (AWS Python interface) Quadratic Runtime Connection Pool

Background I come across some Python code using boto, the Python interface to AWS. There’s boto3, which is a newer version of boto. If you are starting a new project, you should be looking at boto3 instead of boto. I am told that boto pools connections. This is a good thing in terms of performance. Never waste HTTPS connections because it is expensive to setup. But does it limit the number of outgoing connections like SQLAlchemy’s QueuePool, or is it configurable in terms of number of pooled connections?

Use G1GC in Docker ELK's Elasticsearch

There are some GC issues causing Elasticsearch in the ELK stack to freeze and hit 100% CPU usage. The ELK is setup using docker-elk. Elasticsearch 6.5 comes with G1GC support according to this official post. Let’s see if it helps. In docker-compose.yml: elasticsearch: build: context: elasticsearch/ args: ELK_VERSION: $ELK_VERSION volumes: - ./elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml:ro ports: - "9200:9200" - "9300:9300" environment: ES_JAVA_OPTS: "-Xmx256m -Xms256m" networks: - elk Change the line: ES_JAVA_OPTS: "-Xmx256m -Xms256m" to

Scaling ProxySQL with --idle-threads option and epoll

Update 21 Mar 2019 There’s a new post Fixing ProxySQL Idle Threads Epoll Hang Heisenbug about fixing a bug of ProxySQL when --idle-threads is enabled. Problem ProxySQL does not scale with high numbers of connections by default. When there are tens of thousands of client connection to ProxySQL, the CPU usage of ProxySQL may go up to 300%+ because by default it uses 4 threads. Even with 3000 mostly idle connections, it hits ~70% CPU usage.

Use Let's Encrypt with Certbot and nginx inside Docker

Update 8 Jun 2019: Change crontab certbot renew command to use --deploy-hook instead of --renew-hook. Using certbot to install and auto-renew Let’s Encrypt SSL certs with nginx installed in system is almost fool-proof. How about nginx inside docker? Not so easy. Assume we use the official nginx docker imageand start the docker container with name my_nginx. docker run -d -p 80:80 -p 443:443 -v /var/www:/var/www -v /etc/letsencrypt:/etc/letsencrypt --name my_nginx nginx Assuming the domain name is www.

Load Balancing Outgoing Traffic with HAProxy

To work around some restrictions, I need to load balance outgoing traffic such that they appear to be coming from different IPs. Here is how you do it with HAProxy: backend servers server srv1 target.example.com:8080 check source 0.0.0.0 usesrc 10.10.0.4 server srv2 target.example.com:8080 check source 0.0.0.0 usesrc 10.10.0.5 server srv3 target.example.com:8080 check source 0.0.0.0 usesrc 10.10.0.6 Use mode tcp and this works for SSL connections too. The IPs like 10.

Notes on Importing mysqldump

Background: I have multiple 200GB+ MySQL dumps to import. It takes a lot of time. pv shows you the progress so you have the idea how long will the import take. Pipe the dump.sql.gz file to gunzip instead of unzipping it then import. Saves much disk read and write, not to mention the time and effort. e.g. pv dump.sql.gz | gunzip -c | mysql grep is slow for a large file.

Migrating MySQL to Amazon RDS from Azure

I have recently migrated to Amazon RDS from MySQL on Azure VM. This percona article about migrating to RDS is actually very helpful. I have a few things to add though. Setup My setup is a self-hosted MySQL 5.6 server in VM on Azure and a new Amazon RDS for MySQL 5.6. AWS DMS I have tried AWS Database Migration Service (DMS). It is reasonably fast. The main problem is it does not look like mature technology at all.

Random Notes on Scaling

parallel-ssh has a -I option to feed in a shell script file such that quotes escaping is not needed. On Ubuntu 14.04, the HAProxy may have 2 pids after sudo service haproxy restart. I suspect one of them is a zombie since the new open file limit does not apply. The safer way is to killall haproxy then sudo service haproxy start. HAProxy has a maxconn globally and maxconn for each frontend.

Heap and Heapsort in Python

Another exercise from Programming Pearls: Heap. I finished the code a week ago but didn’t have the time for the blog post until now. Here’s the code:

Quicksort in Python

I am reading Programming Pearls and figure it would be fun to write Quicksort in Python. Here are my code and benchmarks. When I try to compare my code with others, I am surprised to see many faulty implementations of Quicksort with quadratic runtime on certain input. The key to write a correct and fast Quicksort: Watch out for list of equal items. qsort2 uses 2 pointers (i, j) to make sure the problem is divided into subproblems of similar size.

Debugging Memory Usage in Python 2.7 with tracemalloc

Update 16 Mar 2019 Updated instructions. Update 16 Oct 2018 There was a bug in pytracemalloc that prevents the PYTHONTRACEMALLOC environment variable from working. I have submitted a pull request and it is now merged. The PR fixed the bug in Python patches, added a testing script for the patches and improved the documentation. Background My application is killed by the OOM killer. How do I find out why the application is taking up so much memory?

Python, SQLAlchemy, Pytest, PyCharm Debugger & NoSuchColumnError

There is this error when I use PyCharm debugger during pytest. NoSuchColumnError: "Could not locate column in row for column 'mytable.id'" This is a SQLAlchemy error. No useful help on Google. No error if the breakpoints are muted. Therefore, it is possibly related to lazy loading of relationships when displaying variables. This can be resolved by turning off the “load values asynchronously” option in PyCharm debugger. As far as I can remember, this is a new feature in PyCharm 2017.

Python & MySQL Interrupted System Call & pyinstrument

Background My Python application cannot connect to MySQL. The error message looks like this: Can’t connect to MySQL server on ‘127.0.0.1’ (4) Error 4 means Interrupted system call. I am using mysqlclient, the C wrapper MySQL connector. The error happens on both MySQL 5.6 and 5.7. It can be reproduced consistently. It seems that PyMySQL doesn’t have this problem. Also I am using gevent but it is not much related in this case.

10x Faster Python gevent Redis Connection Pool

Background Observation I notice a lot of "No connection available." logs from Redis Python client redis-py BlockingConnectionPool. Looking into the source of redis-py: try: connection = self.pool.get(block=True, timeout=self.timeout) except Empty: # Note that this is not caught by the redis client and will be # raised unless handled by application code. If you want never to raise ConnectionError("No connection available.") It happens when there is no connection in the pool available.

Azure TCP Idle Timeout, TCP keepalive, and Python

Observation I have a HTTP request which will return after a specified time t. The request is made using Python requests from an Azure VM. When time t is smaller than 4 minutes, it works fine. Otherwise, the requests library raises ReadTimeout. Explanation TCP connections from Azure has a “not-quite-well-documented” limit which will timeout after 4 minutes of idle activity. The related documentation can be found here under Azure Load Balancer, although it apparently affects Azure VMs with public IP (ILPIP / PIP) without load balancing.

Tutorial: Installing Linux alongside an existing Windows Installation (Dual Boot)

Background A friend of mine requests a tutorial of installing Linux alongside an existing Windows installation so here you go. It is not hard but can be tricky at times. No one wants to spend a day troubleshooting the dual boot setup. The Big Picture Assume that you have a disk (HDD or SSD, doesn’t matter) 100% allocated to a partition (NTFS or whatever) with Windows installed. We need to prepare the installation USB drive, shrink the partition, then boot from the USB drive and install your favorite Linux distro, in this case, Linux Mint or Ubuntu.

Writing a Python Wrapper for html2text using cffi

Background I’ve fixed the html2text performance issue in last post, so now I can use it. I need to use it from Python, and that leaves me not many choices. Python by the C side, a blog post in the PayPal Engineering blog, has listed the options. C extension is hard to code and is not worth it. This post is about the experience and reflections about my first time using cffi.

Fixing html2text Quadratic Runtime

Background I come across this command line utility available in linux called html2text which was first written in 1999 and changed hands later. Obviously an old project, but a solid one. At least it handles a<div><br></div>b properly by outputing a\n\nb, instead of a\n\n\nb like most of the other converters out there. (I’m looking at you, Python html2text.) I download the source code of v1.3.2a from here and play around it.

gevent built-in function getaddrinfo failed with error

Problem Under heavy load using gevent, I see this: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/gevent/threadpool.py", line 207, in _worker value = func(*args, **kwargs) error: [Errno 11] Resource temporarily unavailable (<ThreadPool at 0x7fe468930dd0 0/5/10>, <built-in function getaddrinfo>) failed with error Solution There’s no good solution out there. Actually, it is easier to solve than expected. You only have to change the gevent’s DNS resolver. In the doc, they didn’t clearly state the difference between the resolvers.

gevent.pywsgi's File Descriptor Leak and HTTP Keep-Alive

Background My server uses gevent.pywsgi. It works fine. However, every other few days the server will stop responding to requests. It says “Too many open files” in the logs. Investigation A simple lsof is showing that there are many socket connections opened by pywsgi even when those sessions are completed. This FD (File Descriptor) leak probably causes the process to reach the ulimit -n per-process number of open files limit.

Revisiting Qt (Again): Compiling Qt in Windows

Last time when I compile Qt, I really thought that it would be the last time. No, it isn’t. A user reported that Penguin Subtitle Player cannot be used under a 32-bit OS. Of course! I compiled it in a 64-bit environment. Time to compile Qt (statically) again in Windows! Steps: Install the required tools: Python 2, ActivePerl, Visual C++ Build Tools 2015 Get the Qt source and decompress it Pro tip 1: Don’t use Windows built-in decompression utility because it is horribly slow.

Amazon Alexa Skills Challenge and My Pocket

Update 21 Aug 2017 The skill is now open-source! Visit it on GitHub: github.com/carsonip/alexa-pocket Intro I don’t have much time for this blog post because I’ll be taking an early flight tomorrow. Maybe I’ll add something later. Hackathon To state the obvious, the hackathon is about building Alexa skills. I have submitted 2 entries: My Pocket and Tomato Helper. I’ll focus on My Pocket in this blog post. My Pocket The Alexa skill is about reading your saved pocket articles to you using Alexa.

Revisiting Penguin Subtitle Player and Qt

Update: There’s a new post about building Qt statically in Windows It has been a while since I last updated Penguin Subtitle Player. Anyways, after a few days of work, and more than a year of waiting since the last beta, here comes the first production release of Penguin Subtitle Player. Apart from developing multi-subtitle-format parser support for maximum flexibility and maintainability and fixing a few GitHub issues, most importantly, I have tidied up the project and code to meet the standard of a good open-source project.

Writing a Binary Search

Background As I am preparing for interviews recently, this is a good time to write about my experience in writing a binary search. Please correct me if I get anything wrong in the content. Binary Search Binary search is always one of the most popular algorithms and at the same time tough to implement correctly. It is surprisingly hard to find a correct while easy-to-modify binary search code out there.

Debugging rsync and ssh

Background I fall in love with rsync lately. It is particularly useful when I sync my hadoop stuff (scripts and input, which add up to a few GBs) between local and my hadoop cluster. After running the sync script for a few times, I cannot ssh to the machine anymore. This post is about how I debug it and the lessons learned. This is how the problematic script sync.sh looks like: (the IP is masked for obvious reasons) (warning: this script is faulty, do not use)

Slash Hack 2016

Hackathon It is the second and the last hackathon in my SF trip. The /hack hackathon organized by HackerEarth gives me an enjoyable weekend. Participants are supposed to team up before the day. I didn’t manage to find a team beforehand and end up meeting 2 wonderful teammates at the venue. One Singaporean, one Japanese. They want to work on Amazon Echo and luckily I bring one. Match made in heaven.

GitHub Open Source Hack 2016

Hackathon The weekend is enjoyable. Martin (My roommate in SF) and I join the Open Source Hack organized by GitHub at GitHub HQ in SF. Pretty fun. The venue is great. The pool table is my favorite. Great food all the time. However, the atmosphere is more like “just for fun” or educational than competitive, which is kind of different from what I expected. Idea The guy next to us has an Amazon Echo.

Hello world!

Hello world! This is the first post of the blog. Why did I start this blog? I’m not good at writing and I don’t really enjoy doing it either. However, there’s too many things going on every day and I find it helpful to jot them down. It’s also a good exercise to summerize what you’ve done and write them down clearly and concisely. Why is the blog named “Overthought”?