Background This is the 2nd part of the Fixing Slow MySQL series. If you haven’t seen the 1st part, here it is: Fixing Slow MySQL #1: Foreign Key & Row Lock Contention.
Despite the successful fix in part 1, the database is still slow and the application goes down as load goes up, but the symptoms are different from the previous time. It means we hit another bottleneck.
Database Schema Here’s a minimal database schema to illustrate the issue.Background There is service downtime due to “Slow MySQL”. But why is it slow? And how to fix it? Let’s start digging.
This post is the first part of the series.
Database Schema Here’s a minimal database schema to illustrate the issue.
CREATE TABLE parent ( id INT NOT NULL AUTO_INCREMENT, child_count INT NOT NULL, PRIMARY KEY (id) ); CREATE TABLE child ( id INT NOT NULL AUTO_INCREMENT, parent_id INT, FOREIGN KEY (parent_id) REFERENCES parent(id) ON DELETE CASCADE, PRIMARY KEY (id) ); Add example data to database.Background I have been experiencing headaches and fatigue since I moved into my new apartment. I suspect that they are caused by the preinstalled LED light bulbs.
Flickering Although high-frequency flickers are imperceptible to the human eye, prolonged exposure to a flickering light source may cause headaches and fatigue.
Physics There are multiple reasons for flickering LEDs, e.g. PWM dimming and poor AC/DC converters (e.g. LED drivers).
PWM dimming is a way to adjust brightness on LED backlit monitors and other dimmable LED bulbs.I’m a fan of Uncle Bob. I read Clean Code, Clean Coder, Clean architecture and now Clean Agile. Despite being opinionated, they provide valuable insight into software development. This book is very relevant because I just started working in an agile team. It gives more context of the practices that we use. I’ll summarize some lessons I’ve learned from the book.
There’s the “Iron Cross of project management”.Introduction This is a post to make a note of my recent journey of diagnosing and fixing a failing HTTPS API request.
Background I have a user having issues with the Android app. It appears that the app has difficulty communicating with the server (API gateway). A generic connection error is shown to the user upon API call.
Observation 0 It cannot be reproduced locally. None of my devices (phones and browsers) has errors communicating with the server.In anticipation of growing traffic, a few websites are migrated to Cloudflare for better user experience. Here are the lessons learned:
1. Auto DNS record import is not perfect The automatic DNS record import feature comes in very handy, but it may be tricky if you have something more than simple DNS records.
Since the first few migrations are very sucessful, I go on to migrate a site with more DNS records.Benefits of a serverless cron job What’s great about going serverless with a cron job is that there is no need to setup the environment, configure crontab, logging, error detection, etc. And you don’t have to worry about if the VM is down. It is also easier on your wallet.
Here I am using AWS Lambda for my serverless cron job.
Configuration and tips AWS Lambda function can be triggered periodically using CloudWatch events.1. OpenVPN & MFA There are few tutorials about setting up OpenVPN and multi-factor authentication. For MFA, Google Authenticator is used here.
To setup OpenVPN server, I use OpenVPN road warrior installer. For MFA config, the script from egonbraun is forked and modified to work with road warrior setup: https://gist.github.com/carsonip/b02eecf9f7d036555a53fea6f516ced8
Also, the default OVPN server will renegotiate every 1 hour, but this will fail the MFA authentication, effectively disconnecting the OVPN connection periodically.Background Profiling shows that pymongo’s bson.json_util.loads is consuming an unusual amount of CPU time.
Benchmark To confirm the function is bad, let’s do some benchmarking.
import pyperf def data(): import json return json.dumps(['asdfasdf%s' % i for i in xrange(20)]) s = data() runner = pyperf.Runner() runner.timeit(name="json", stmt="json.loads(s)", setup="from __main__ import s; import json;") runner.timeit(name="simplejson", stmt="simplejson.loads(s)", setup="from __main__ import s; import simplejson;") runner.timeit(name="bson json_util", stmt="json_util.loads(s)", setup="from __main__ import s; from bson import json_util;") Result:I am dealing with golang and CI (namely CircleCI) recently. There are lots of tricks to speed up the build / test in CI and this is a short post about go fmt.
This CircleCI blog post suggests the use of this:
! go fmt ./... 2>&1 | read However, I find this unreliable as it works locally but exits with return code >128 for read on CircleCI. According to the man page, it indicates some kind of a timeout.This popular book was on my list for a while until I recently have the time to read it. The book is about how the software systems are managed throughout their lifecycle in Google at its massive scale. Here I will jot down some key takeaways.
SRE enables a better balance between innovation and reliability of products. (Chapter 3)
Introducing planned outage may help identify parts that have false assumptions about reliability.I have spent some time researching on InnoDB row formats and InnoDB compression. This article is about things that I have learned.
InnoDB file format and row format There are 2 file formats, Antelope and Barracuda, with the latter being the newer. Antelope supports row formats Redundant and Compact, whereas Barracuda supports 2 more, Dynamic and Compressed.
Dynamic vs Compact Compact is the default row format in MySQL 5.6. The difference between Compact and Dynamic is the handling of BLOBs (including BLOB, TEXT, VARCHAR, etc).In the last binary search article, we discussed how to write a correct binary search. But sometimes we are not asked to find any position of the target. If we need the first or the last occurrence, can we do it?
The classic binary search In Python, the basic binary search looks like this:
lo = 0 hi = N-1 while (lo < hi): mid = (lo+hi)//2 if A[mid] == target: return mid elif A[mid] < target: lo = mid + 1 else: hi = mid return lo if A[lo] == target else -1 And the invariant of the algorithm is that A[lo] <= target <= A[hi].Background My gevent application hangs after issuing a query that yields 1000+ rows and then a COMMIT. After fixing mysqlclient (see Python mysqlclient Doesn’t Work Well with gevent), the COMMIT does not hang the whole application, but it still hangs the greenlet that issues the SQL query.
Setup: Ubuntu 16.04, ProxySQL v1.4.13, Amazon RDS MySQL 5.6 (ProxySQL backend), Python 2.7.12 application running on the same machine as ProxySQL
Symptoms It is always reproducible with the specific query, the backend, and on that machine.mysqlclient (MySQLdb fork) is a Python MySQL driver which uses libmysqlclient (or libmariadbclient). This means speed and stability. Does it work well with gevent? Not really.
In mysqlclient <= v1.3.13, there’s a sample called waiter_gevent.py which looks like this:
from __future__ import print_function """Demo using Gevent with mysqlclient.""" import gevent.hub import MySQLdb def gevent_waiter(fd, hub=gevent.hub.get_hub()): hub.wait(hub.loop.io(fd, 1)) def f(n): conn = MySQLdb.connect(user='root', waiter=gevent_waiter) cur = conn.cursor() cur.execute("SELECT SLEEP(%s)", (n,)) cur.execute("SELECT 1+%s", (n,)) print(cur.Background I come across some Python code using boto, the Python interface to AWS. There’s boto3, which is a newer version of boto. If you are starting a new project, you should be looking at boto3 instead of boto.
I am told that boto pools connections. This is a good thing in terms of performance. Never waste HTTPS connections because it is expensive to setup. But does it limit the number of outgoing connections like SQLAlchemy’s QueuePool, or is it configurable in terms of number of pooled connections?Update 22 Oct 2019: Using G1GC did bring down the CPU usage a bit, but the excessive CPU usage is caused by an Elasticsearch bug described in this GitHub issue.
There are some GC issues causing Elasticsearch in the ELK stack to freeze and hit 100% CPU usage. The ELK is setup using docker-elk.
Elasticsearch 6.5 comes with G1GC support according to this official post. Let’s see if it helps.Update 21 Mar 2019 There’s a new post Fixing ProxySQL Idle Threads Epoll Hang Heisenbug about fixing a bug of ProxySQL when --idle-threads is enabled.
Problem ProxySQL does not scale with high numbers of connections by default. When there are tens of thousands of client connection to ProxySQL, the CPU usage of ProxySQL may go up to 300%+ because by default it uses 4 threads. Even with 3000 mostly idle connections, it hits ~70% CPU usage.Update 8 Jun 2019: Change crontab certbot renew command to use --deploy-hook instead of --renew-hook.
Using certbot to install and auto-renew Let’s Encrypt SSL certs with nginx installed in system is almost fool-proof. How about nginx inside docker? Not so easy.
Assume we use the official nginx docker imageand start the docker container with name my_nginx.
docker run -d -p 80:80 -p 443:443 -v /var/www:/var/www -v /etc/letsencrypt:/etc/letsencrypt --name my_nginx nginx Assuming the domain name is www.To work around some restrictions, I need to load balance outgoing traffic such that they appear to be coming from different IPs.
Here is how you do it with HAProxy:
backend servers server srv1 target.example.com:8080 check source 0.0.0.0 usesrc 10.10.0.4 server srv2 target.example.com:8080 check source 0.0.0.0 usesrc 10.10.0.5 server srv3 target.example.com:8080 check source 0.0.0.0 usesrc 10.10.0.6 Use mode tcp and this works for SSL connections too. The IPs like 10.10.0.4 in the example are private IPs, but they will be translated to different public IPs thanks to Azure’s SNAT.Background: I have multiple 200GB+ MySQL dumps to import. It takes a lot of time.
pv shows you the progress so you have the idea how long will the import take. Pipe the dump.sql.gz file to gunzip instead of unzipping it then import. Saves much disk read and write, not to mention the time and effort. e.g. pv dump.sql.gz | gunzip -c | mysql grep is slow for a large file.I have recently migrated to Amazon RDS from MySQL on Azure VM.
This percona article about migrating to RDS is actually very helpful. I have a few things to add though.
Setup My setup is a self-hosted MySQL 5.6 server in VM on Azure and a new Amazon RDS for MySQL 5.6.
AWS DMS I have tried AWS Database Migration Service (DMS). It is reasonably fast. The main problem is it does not look like mature technology at all.parallel-ssh has a -I option to feed in a shell script file such that quotes escaping is not needed. On Ubuntu 14.04, the HAProxy may have 2 pids after sudo service haproxy restart. I suspect one of them is a zombie since the new open file limit does not apply. The safer way is to killall haproxy then sudo service haproxy start. HAProxy has a maxconn globally and maxconn for each frontend.Another exercise from Programming Pearls: Heap. I finished the code a week ago but didn’t have the time for the blog post until now.
Here’s the code:I am reading Programming Pearls and figure it would be fun to write Quicksort in Python.
Here are my code and benchmarks.
When I try to compare my code with others, I am surprised to see many faulty implementations of Quicksort with quadratic runtime on certain input.
The key to write a correct and fast Quicksort:
Watch out for list of equal items. qsort2 uses 2 pointers (i, j) to make sure the problem is divided into subproblems of similar size.Update 16 Mar 2019 Updated instructions.
Update 16 Oct 2018 There was a bug in pytracemalloc that prevents the PYTHONTRACEMALLOC environment variable from working. I have submitted a pull request and it is now merged. The PR fixed the bug in Python patches, added a testing script for the patches and improved the documentation.
Background My application is killed by the OOM killer. How do I find out why the application is taking up so much memory?There is this error when I use PyCharm debugger during pytest.
NoSuchColumnError: "Could not locate column in row for column 'mytable.id'" This is a SQLAlchemy error. No useful help on Google. No error if the breakpoints are muted. Therefore, it is possibly related to lazy loading of relationships when displaying variables.
This can be resolved by turning off the “load values asynchronously” option in PyCharm debugger. As far as I can remember, this is a new feature in PyCharm 2017.Background My Python application cannot connect to MySQL. The error message looks like this:
Can’t connect to MySQL server on ‘127.0.0.1’ (4)
Error 4 means Interrupted system call.
I am using mysqlclient, the C wrapper MySQL connector. The error happens on both MySQL 5.6 and 5.7. It can be reproduced consistently. It seems that PyMySQL doesn’t have this problem. Also I am using gevent but it is not much related in this case.Background Observation I notice a lot of "No connection available." logs from Redis Python client redis-py BlockingConnectionPool. Looking into the source of redis-py:
try: connection = self.pool.get(block=True, timeout=self.timeout) except Empty: # Note that this is not caught by the redis client and will be # raised unless handled by application code. If you want never to raise ConnectionError("No connection available.") It happens when there is no connection in the pool available.Observation I have a HTTP request which will return after a specified time t. The request is made using Python requests from an Azure VM. When time t is smaller than 4 minutes, it works fine. Otherwise, the requests library raises ReadTimeout.
Explanation TCP connections from Azure has a “not-quite-well-documented” limit which will timeout after 4 minutes of idle activity. The related documentation can be found here under Azure Load Balancer, although it apparently affects Azure VMs with public IP (ILPIP / PIP) without load balancing.Background A friend of mine requests a tutorial of installing Linux alongside an existing Windows installation so here you go. It is not hard but can be tricky at times. No one wants to spend a day troubleshooting the dual boot setup.
The Big Picture Assume that you have a disk (HDD or SSD, doesn’t matter) 100% allocated to a partition (NTFS or whatever) with Windows installed.
We need to prepare the installation USB drive, shrink the partition, then boot from the USB drive and install your favorite Linux distro, in this case, Linux Mint or Ubuntu.Background I’ve fixed the html2text performance issue in last post, so now I can use it. I need to use it from Python, and that leaves me not many choices. Python by the C side, a blog post in the PayPal Engineering blog, has listed the options. C extension is hard to code and is not worth it. This post is about the experience and reflections about my first time using cffi.Background I come across this command line utility available in linux called html2text which was first written in 1999 and changed hands later. Obviously an old project, but a solid one. At least it handles a<div><br></div>b properly by outputing a\n\nb, instead of a\n\n\nb like most of the other converters out there. (I’m looking at you, Python html2text.)
I download the source code of v1.3.2a from here and play around it.Problem Under heavy load using gevent, I see this:
Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/gevent/threadpool.py", line 207, in _worker value = func(*args, **kwargs) error: [Errno 11] Resource temporarily unavailable (<ThreadPool at 0x7fe468930dd0 0/5/10>, <built-in function getaddrinfo>) failed with error Solution There’s no good solution out there. Actually, it is easier to solve than expected. You only have to change the gevent’s DNS resolver.
In the doc, they didn’t clearly state the difference between the resolvers.Background My server uses gevent.pywsgi. It works fine. However, every other few days the server will stop responding to requests. It says “Too many open files” in the logs.
Investigation A simple lsof is showing that there are many socket connections opened by pywsgi even when those sessions are completed. This FD (File Descriptor) leak probably causes the process to reach the ulimit -n per-process number of open files limit.Last time when I compile Qt, I really thought that it would be the last time. No, it isn’t.
A user reported that Penguin Subtitle Player cannot be used under a 32-bit OS. Of course! I compiled it in a 64-bit environment. Time to compile Qt (statically) again in Windows!
Install the required tools: Python 2, ActivePerl, Visual C++ Build Tools 2015 Get the Qt source and decompress it Pro tip 1: Don’t use Windows built-in decompression utility because it is horribly slow.Update 21 Aug 2017 The skill is now open-source! Visit it on GitHub: github.com/carsonip/alexa-pocket
Intro I don’t have much time for this blog post because I’ll be taking an early flight tomorrow. Maybe I’ll add something later.
Hackathon To state the obvious, the hackathon is about building Alexa skills. I have submitted 2 entries: My Pocket and Tomato Helper. I’ll focus on My Pocket in this blog post.
My Pocket The Alexa skill is about reading your saved pocket articles to you using Alexa.Update: There’s a new post about building Qt statically in Windows
It has been a while since I last updated Penguin Subtitle Player. Anyways, after a few days of work, and more than a year of waiting since the last beta, here comes the first production release of Penguin Subtitle Player. Apart from developing multi-subtitle-format parser support for maximum flexibility and maintainability and fixing a few GitHub issues, most importantly, I have tidied up the project and code to meet the standard of a good open-source project.Background As I am preparing for interviews recently, this is a good time to write about my experience in writing a binary search. Please correct me if I get anything wrong in the content.
Binary Search Binary search is always one of the most popular algorithms and at the same time tough to implement correctly. It is surprisingly hard to find a correct while easy-to-modify binary search code out there.Background I fall in love with rsync lately. It is particularly useful when I sync my hadoop stuff (scripts and input, which add up to a few GBs) between local and my hadoop cluster. After running the sync script for a few times, I cannot ssh to the machine anymore. This post is about how I debug it and the lessons learned.
This is how the problematic script sync.sh looks like: (the IP is masked for obvious reasons) (warning: this script is faulty, do not use)Hackathon It is the second and the last hackathon in my SF trip. The /hack hackathon organized by HackerEarth gives me an enjoyable weekend.
Participants are supposed to team up before the day. I didn’t manage to find a team beforehand and end up meeting 2 wonderful teammates at the venue. One Singaporean, one Japanese. They want to work on Amazon Echo and luckily I bring one. Match made in heaven.Hackathon The weekend is enjoyable. Martin (My roommate in SF) and I join the Open Source Hack organized by GitHub at GitHub HQ in SF. Pretty fun. The venue is great. The pool table is my favorite. Great food all the time. However, the atmosphere is more like “just for fun” or educational than competitive, which is kind of different from what I expected.
Idea The guy next to us has an Amazon Echo.Hello world! This is the first post of the blog.
Why did I start this blog? I’m not good at writing and I don’t really enjoy doing it either. However, there’s too many things going on every day and I find it helpful to jot them down. It’s also a good exercise to summerize what you’ve done and write them down clearly and concisely.
Why is the blog named “Overthought”?