Fixing ProxySQL Idle Threads Epoll Hang Heisenbug
19/Mar 2019Background
My gevent application hangs after issuing a query that yields 1000+ rows and then a COMMIT. After fixing mysqlclient (see Python mysqlclient Doesn’t Work Well with gevent), the COMMIT does not hang the whole application, but it still hangs the greenlet that issues the SQL query.
Setup: Ubuntu 16.04, ProxySQL v1.4.13, Amazon RDS MySQL 5.6 (ProxySQL backend), Python 2.7.12 application running on the same machine as ProxySQL
Symptoms
It is always reproducible with the specific query, the backend, and on that machine. The hang always happens on waiting for data on the FD that issues the COMMIT query. Time taken to reproduce the issue varies from a few minutes to a few hours.
Finding the cause
It does not appear to be a ProxySQL problem at first.
Monitoring and tracing
socat
Since my Python application uses UNIX domain socket to connect to ProxySQL. It does not go through TCP/IP stack so that it is faster. But this also means that we cannot use tcpdump / wireshark. This StackOverflow post says that socat
works like tcpdump
but for UNIX domain sockets. Bad news is when socat
is used, the issue is gone.
strace
strace
traces system calls. Seeing the last few system calls (and their return values) before a hang will be very useful in understanding the problem. But it does not hang when strace
is used.
Application
“Heisenbug” is the term to describe a bug like this. According to Wikipedia,
In computer programming jargon, a heisenbug is a software bug that seems to disappear or alter its behavior when one attempts to study it.
Since no external tools can diagnose the issue, I have to look at the application itself to see if my code or the libraries are responsible for this.
MySQL driver
The hang issue happens with mysqlclient and not with PyMySQL. A lot of time is spent on reviewing mysqlclient (MySQLdb) source. Since it uses libmysqlclient and the Python wrapper is very thin, it does not seem to be the cause. The weird thing is when I put a few print statements around the area of question, the issue disappears.
A C++ MySQL client program using libmysqlclient is written to issue the exact same query and COMMIT. It does reproduce the issue. This confirms that my application and mysqlclient are not the cause. Why PyMySQL does not reproduce is unclear (at the time).
ProxySQL and MySQL
If the bug does not exist in application, the next layer to look at is ProxySQL and MySQL. The chance of the bug existing in ProxySQL is higher than MySQL given how popular MySQL is.
ProxySQL
ProxySQL has a --idle-threads
option that uses epoll instead of poll for idle connections. This significantly reduces CPU usage when the number of connection is high. See Scaling ProxySQL with –idle-threads option and epoll blog post for more.
Elimination can quickly determine which part is responsible for the bug.
ProxySQL with UNIX domain socket and –idle-threads (the original setup): hang
Bypass ProxySQL, TCP/IP to MySQL: no hang
ProxySQL with TCP/IP with –idle-threads: no hang
ProxySQL with UNIX domain socket without –idle-threads: no hang
It is very clear that ProxySQL contains the bug and it is due to the idle threads option.
Debug
It is possible to turn on ProxySQL query logging to see which query goes through, but knowing whether or not the COMMIT goes through does not help fix the bug. The other issue with the default query logging is that the server is too busy (a few k QPS) that the log will contain too much noise.
I review ProxySQL’s source code and find a few low-hanging fruits, but none of those small bugs causes the hang. ProxySQL’s source code is then modified to log the inner workings in order to locate the problematic part.
Since the server is busy, I add special query handling in ProxySQL to signal whether that session should be tracked. This eliminates noise in the log.
The debugging version of ProxySQL: https://github.com/carsonip/proxysql/tree/ca8fef7f8dfe6d5e6f7e47fb12c6b9ae2743fe1f
Examining the logs
The part of log involving the hang is extracted and uploaded to GitHub Gist. The earlier part of the log exist to demonstrate how things work usually. The later part is how things go south. Here’s the Gist:
https://gist.github.com/carsonip/da65093997b00c69538422f028bcef47
Line 965 of the log shows session pause_until
is modified as a result of large result set and throttling (related to throttle_max_bytes_per_second_to_client
). Since the session is throttled, it will stop sending out data until pause_until
. However, when the poll thread scans for idle sessions, that specific session is identified as idle, and is therefore moved to the epoll thread. Epoll thread only listens to EPOLLIN
and that is how that session stays in the epoll thread forever.
(In fact, if you examine the logs carefully, that session is moved out of the epoll thread then moved back into it for unknown reason.)
Fix
Once the cause is clear, the fix is obvious.
- Don’t move throttled sessions to epoll thread (a.k.a idle maintainence thread) (#1952)
Enhancement:
- Allow users to disable throttling. (#1953)
Conclusion
This is a heisenbug because any monitoring or tracing will slow down the IO and reduce the chance of throttling. UNIX domain socket being so fast is also one of the reasons.
Takeaway
Don’t stop after creating an issue. DIY. Project maintainers may not be able to reproduce and debug the bug because it can be very environment-dependent.
Locate the area of question by elimination. This saves your time of checking many other components.
Gather detailed logs and make sure they are not noisy. If the logs are good enough, it quickly pinpoints the problem.