I’ve fixed the html2text performance issue in last post, so now I can use it. I need to use it from Python, and that leaves me not many choices. Python by the C side, a blog post in the PayPal Engineering blog, has listed the options. C extension is hard to code and is not worth it. This post is about the experience and reflections about my first time using cffi.
The whole idea is to compile the html2text project as a shared library (
.so), then load it in Python using cffi. If it works, package it for distribution.
Compiling is not hard. Simply add a
-fPIC flag when compiling and add a
-shared flag when linking to the
Makefile, and there you have a
For the cffi part, I am probably using ABI level, out-of-line, meaning that I am accessing the library at binary level, and the
.so is prepared beforehand.
To load the library using cffi:
ffi = cffi.FFI() ffi.cdef('char *cffi_html2text(char *html);' 'void cffi_free(char *ret);') here = os.path.abspath(os.path.dirname(__file__)) C = ffi.dlopen(os.path.join(here, 'libhtml2text.so'))
To make things work, I create a function
cffi_html2text and a teardown function
cffi_free in C, and I will call them from Python.
The Python code looks like this:
def html2text(html): x = C.cffi_html2text(html) if x == ffi.NULL: raise Exception('NULL') s = ffi.string(x) C.cffi_free(x) return s
It is very simple at the moment, but it does the work.
Distributing and installing
The next big problem is how to distribute the code. The
.so should be platform dependent. It should not ship with the package.
Now the problem is how to write
setup.py such that when user installs the Python package, it compiles the C/C++ code into a
.so and copies it next to the Python code. I cannot use
distutils since I have to run
./configure && make.
It takes a lot of time to find a solution to this. The closest SO thread is here, but it doesn’t work well nowadays. A simple override of
install command only works when the user
python setup.py install manually. However,
python setup.py bdist_wheel.
After a while, I find the solution from
llvmlite here. It contains what I need exactly, which is to override
Now that the
.so is compiled during
pip install, make sure the
.so gets copied to where the Python files are by setting the
package_data option in setup.
Drawbacks and traps
- C + cffi is fast enough for my use. But the core problem is stability. If the C program segfaults, the whole Python program segfaults and you can do nothing about that. Indeed, I see bug reports of html2text that it segfaults on some input. I just cannot take the risk of crashing the whole Python program without alert.
- To pass the result back from C to Python, it seems to me that the best reliable way is to
mallocit, pass the reference back to Python, load it to a variable using cffi functions, then
freeit in C through cffi. You cannot pass a local variable in C to Python because when the C function ends, the memory likely gets overwritten. Using the
freemethod, in case of Python exceptions and
freeis not called, there will be memory leak. The current implementation (the code above) is naive. I probably have to put the
freefunction call in a
And here is where I arrive at last, the pyhtml2text project. Feel free to leave a comment. This is my first time wrapping C code for Python. If I have done anything stupid, make sure to let me know.
To conclude, I probably need to write a Python version of html2text due to stability requirements.