Heap and Heapsort in Python

Another exercise from Programming Pearls: Heap. I finished the code a week ago but didn’t have the time for the blog post until now. Here’s the code:

Quicksort in Python

I am reading Programming Pearls and figure it would be fun to write Quicksort in Python. Here are my code and benchmarks. When I try to compare my code with others, I am surprised to see many faulty implementations of Quicksort with quadratic runtime on certain input. The key to write a correct and fast Quicksort: Watch out for list of equal items. qsort2 uses 2 pointers (i, j) to make sure the problem is divided into subproblems of similar size.

Debugging Memory Usage in Python 2.7 with tracemalloc

Update 16 Oct 2018 There was a bug in pytracemalloc that prevents the PYTHONTRACEMALLOC environment variable from working. I have submitted a pull request and it is now merged. The PR fixed the bug in Python patches, added a testing script for the patches and improved the documentation. Background My application is killed by the OOM killer. How do I find out why the application is taking up so much memory?

Python, SQLAlchemy, Pytest, PyCharm Debugger & NoSuchColumnError

There is this error when I use PyCharm debugger during pytest. NoSuchColumnError: "Could not locate column in row for column 'mytable.id'" This is a SQLAlchemy error. No useful help on Google. No error if the breakpoints are muted. Therefore, it is possibly related to lazy loading of relationships when displaying variables. This can be resolved by turning off the “load values asynchronously” option in PyCharm debugger. As far as I can remember, this is a new feature in PyCharm 2017.

Python & MySQL Interrupted System Call & pyinstrument

Background My Python application cannot connect to MySQL. The error message looks like this: Can’t connect to MySQL server on ‘127.0.0.1’ (4) Error 4 means Interrupted system call. I am using mysqlclient, the C wrapper MySQL connector. The error happens on both MySQL 5.6 and 5.7. It can be reproduced consistently. It seems that PyMySQL doesn’t have this problem. Also I am using gevent but it is not much related in this case.

10x Faster Python gevent Redis Connection Pool

Background Observation I notice a lot of "No connection available." logs from Redis Python client redis-py BlockingConnectionPool. Looking into the source of redis-py: try: connection = self.pool.get(block=True, timeout=self.timeout) except Empty: # Note that this is not caught by the redis client and will be # raised unless handled by application code. If you want never to raise ConnectionError("No connection available.") It happens when there is no connection in the pool available.

Azure TCP Idle Timeout, TCP keepalive, and Python

Observation I have a HTTP request which will return after a specified time t. The request is made using Python requests from an Azure VM. When time t is smaller than 4 minutes, it works fine. Otherwise, the requests library raises ReadTimeout. Explanation TCP connections from Azure has a “not-quite-well-documented” limit which will timeout after 4 minutes of idle activity. The related documentation can be found here under Azure Load Balancer, although it apparently affects Azure VMs with public IP (ILPIP / PIP) without load balancing.

Tutorial: Installing Linux alongside an existing Windows Installation (Dual Boot)

Background A friend of mine requests a tutorial of installing Linux alongside an existing Windows installation so here you go. It is not hard but can be tricky at times. No one wants to spend a day troubleshooting the dual boot setup. The Big Picture Assume that you have a disk (HDD or SSD, doesn’t matter) 100% allocated to a partition (NTFS or whatever) with Windows installed. We need to prepare the installation USB drive, shrink the partition, then boot from the USB drive and install your favorite Linux distro, in this case, Linux Mint or Ubuntu.

Writing a Python Wrapper for html2text using cffi

Background I’ve fixed the html2text performance issue in last post, so now I can use it. I need to use it from Python, and that leaves me not many choices. Python by the C side, a blog post in the PayPal Engineering blog, has listed the options. C extension is hard to code and is not worth it. This post is about the experience and reflections about my first time using cffi.

Fixing html2text Quadratic Runtime

Background I come across this command line utility available in linux called html2text which was first written in 1999 and changed hands later. Obviously an old project, but a solid one. At least it handles a<div><br></div>b properly by outputing a\n\nb, instead of a\n\n\nb like most of the other converters out there. (I’m looking at you, Python html2text.) I download the source code of v1.3.2a from here and play around it.

gevent built-in function getaddrinfo failed with error

Problem Under heavy load using gevent, I see this: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/gevent/threadpool.py", line 207, in _worker value = func(*args, **kwargs) error: [Errno 11] Resource temporarily unavailable (<ThreadPool at 0x7fe468930dd0 0/5/10>, <built-in function getaddrinfo>) failed with error Solution There’s no good solution out there. Actually, it is easier to solve than expected. You only have to change the gevent’s DNS resolver. In the doc, they didn’t clearly state the difference between the resolvers.

gevent.pywsgi's File Descriptor Leak and HTTP Keep-Alive

Background My server uses gevent.pywsgi. It works fine. However, every other few days the server will stop responding to requests. It says “Too many open files” in the logs. Investigation A simple lsof is showing that there are many socket connections opened by pywsgi even when those sessions are completed. This FD (File Descriptor) leak probably causes the process to reach the ulimit -n per-process number of open files limit.

Revisiting Qt (Again): Compiling Qt in Windows

Last time when I compile Qt, I really thought that it would be the last time. No, it isn’t. A user reported that Penguin Subtitle Player cannot be used under a 32-bit OS. Of course! I compiled it in a 64-bit environment. Time to compile Qt (statically) again in Windows! Steps: Install the required tools: Python 2, ActivePerl, Visual C++ Build Tools 2015 Get the Qt source and decompress it Pro tip 1: Don’t use Windows built-in decompression utility because it is horribly slow.

Amazon Alexa Skills Challenge and My Pocket

Update 21 Aug 2017 The skill is now open-source! Visit it on GitHub: github.com/carsonip/alexa-pocket Intro I don’t have much time for this blog post because I’ll be taking an early flight tomorrow. Maybe I’ll add something later. Hackathon To state the obvious, the hackathon is about building Alexa skills. I have submitted 2 entries: My Pocket and Tomato Helper. I’ll focus on My Pocket in this blog post. My Pocket The Alexa skill is about reading your saved pocket articles to you using Alexa.

Revisiting Penguin Subtitle Player and Qt

Update: There’s a new post about building Qt statically in Windows It has been a while since I last updated Penguin Subtitle Player. Anyways, after a few days of work, and more than a year of waiting since the last beta, here comes the first production release of Penguin Subtitle Player. Apart from developing multi-subtitle-format parser support for maximum flexibility and maintainability and fixing a few GitHub issues, most importantly, I have tidied up the project and code to meet the standard of a good open-source project.

Writing a Binary Search

Background As I am preparing for interviews recently, this is a good time to write about my experience in writing a binary search. Please correct me if I get anything wrong in the content. Binary Search Binary search is always one of the most popular algorithms and at the same time tough to implement correctly. It is surprisingly hard to find a correct while easy-to-modify binary search code out there.

Debugging rsync and ssh

Background I fall in love with rsync lately. It is particularly useful when I sync my hadoop stuff (scripts and input, which add up to a few GBs) between local and my hadoop cluster. After running the sync script for a few times, I cannot ssh to the machine anymore. This post is about how I debug it and the lessons learned. This is how the problematic script sync.sh looks like: (the IP is masked for obvious reasons) (warning: this script is faulty, do not use)

Slash Hack 2016

Hackathon It is the second and the last hackathon in my SF trip. The /hack hackathon organized by HackerEarth gives me an enjoyable weekend. Participants are supposed to team up before the day. I didn’t manage to find a team beforehand and end up meeting 2 wonderful teammates at the venue. One Singaporean, one Japanese. They want to work on Amazon Echo and luckily I bring one. Match made in heaven.

GitHub Open Source Hack 2016

Hackathon The weekend is enjoyable. Martin (My roommate in SF) and I join the Open Source Hack organized by GitHub at GitHub HQ in SF. Pretty fun. The venue is great. The pool table is my favorite. Great food all the time. However, the atmosphere is more like “just for fun” or educational than competitive, which is kind of different from what I expected. Idea The guy next to us has an Amazon Echo.

Hello world!

Hello world! This is the first post of the blog. Why did I start this blog? I’m not good at writing and I don’t really enjoy doing it either. However, there’s too many things going on every day and I find it helpful to jot them down. It’s also a good exercise to summerize what you’ve done and write them down clearly and concisely. Why is the blog named “Overthought”?