Writing a Python Wrapper for html2text using cffi

Background

I’ve fixed the html2text performance issue in last post, so now I can use it. I need to use it from Python, and that leaves me not many choices. Python by the C side, a blog post in the PayPal Engineering blog, has listed the options. C extension is hard to code and is not worth it. This post is about the experience and reflections about my first time using cffi.

The plan

The whole idea is to compile the html2text project as a shared library (.so), then load it in Python using cffi. If it works, package it for distribution.

Execution

Compiling is not hard. Simply add a -fPIC flag when compiling and add a -shared flag when linking to the Makefile, and there you have a .so.

For the cffi part, I am probably using ABI level, out-of-line, meaning that I am accessing the library at binary level, and the .so is prepared beforehand.

To load the library using cffi:

ffi = cffi.FFI()
ffi.cdef('char *cffi_html2text(char *html);'
         'void cffi_free(char *ret);')
here = os.path.abspath(os.path.dirname(__file__))
C = ffi.dlopen(os.path.join(here, 'libhtml2text.so'))

To make things work, I create a function cffi_html2text and a teardown function cffi_free in C, and I will call them from Python.

The Python code looks like this:

def html2text(html):
    x = C.cffi_html2text(html)

    if x == ffi.NULL:
        raise Exception('NULL')
    s = ffi.string(x)
    C.cffi_free(x)
    return s

It is very simple at the moment, but it does the work.

Distributing and installing

The next big problem is how to distribute the code. The .so should be platform dependent. It should not ship with the package.

Now the problem is how to write setup.py such that when user installs the Python package, it compiles the C/C++ code into a .so and copies it next to the Python code. I cannot use Extension from distutils since I have to run ./configure && make.

It takes a lot of time to find a solution to this. The closest SO thread is here, but it doesn’t work well nowadays. A simple override of setuptool’s install command only works when the user python setup.py install manually. However, pip uses python setup.py bdist_wheel.

After a while, I find the solution from llvmlite here. It contains what I need exactly, which is to override install, build, build_ext and bdist_wheel.

Now that the .so is compiled during pip install, make sure the .so gets copied to where the Python files are by setting the package_data option in setup.

Drawbacks and traps

  1. C + cffi is fast enough for my use. But the core problem is stability. If the C program segfaults, the whole Python program segfaults and you can do nothing about that. Indeed, I see bug reports of html2text that it segfaults on some input. I just cannot take the risk of crashing the whole Python program without alert.
  2. To pass the result back from C to Python, it seems to me that the best reliable way is to malloc it, pass the reference back to Python, load it to a variable using cffi functions, then free it in C through cffi. You cannot pass a local variable in C to Python because when the C function ends, the memory likely gets overwritten. Using the malloc and free method, in case of Python exceptions and free is not called, there will be memory leak. The current implementation (the code above) is naive. I probably have to put the free function call in a finally clause.
  3. The setup.py mess above.

Result

And here is where I arrive at last, the pyhtml2text project. Feel free to leave a comment. This is my first time wrapping C code for Python. If I have done anything stupid, make sure to let me know.

To conclude, I probably need to write a Python version of html2text due to stability requirements.