Introduction This is a post to make a note of my recent journey of diagnosing and fixing a failing HTTPS API request.
Background I have a user having issues with the Android app. It appears that the app has difficulty communicating with the server (API gateway). A generic connection error is shown to the user upon API call.
Observation 0 It cannot be reproduced locally. None of my devices (phones and browsers) has errors communicating with the server.Observation I have a HTTP request which will return after a specified time t. The request is made using Python requests from an Azure VM. When time t is smaller than 4 minutes, it works fine. Otherwise, the requests library raises ReadTimeout.
Explanation TCP connections from Azure has a “not-quite-well-documented” limit which will timeout after 4 minutes of idle activity. The related documentation can be found here under Azure Load Balancer, although it apparently affects Azure VMs with public IP (ILPIP / PIP) without load balancing.Problem Under heavy load using gevent, I see this:
Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/gevent/threadpool.py", line 207, in _worker value = func(*args, **kwargs) error: [Errno 11] Resource temporarily unavailable (<ThreadPool at 0x7fe468930dd0 0/5/10>, <built-in function getaddrinfo>) failed with error Solution There’s no good solution out there. Actually, it is easier to solve than expected. You only have to change the gevent’s DNS resolver.
In the doc, they didn’t clearly state the difference between the resolvers.Background My server uses gevent.pywsgi. It works fine. However, every other few days the server will stop responding to requests. It says “Too many open files” in the logs.
Investigation A simple lsof is showing that there are many socket connections opened by pywsgi even when those sessions are completed. This FD (File Descriptor) leak probably causes the process to reach the ulimit -n per-process number of open files limit.