Fixing html2text Quadratic Runtime

Background

I come across this command line utility available in linux called html2text which was first written in 1999 and changed hands later. Obviously an old project, but a solid one. At least it handles a<div><br></div>b properly by outputing a\n\nb, instead of a\n\n\nb like most of the other converters out there. (I’m looking at you, Python html2text.)

I download the source code of v1.3.2a from here and play around it. It is fast, but utf-8 support is lacking, which is a deal breaker for me.

Then I realize the debian version has a utf8 flag so I install it with sudo apt install html2text. Neat. The simple utf8 html input work now. But when I input a 0.05MB html, it takes 15 seconds to process, which is ridiculously slow. Something’s wrong.

Investigation

The straightforward way to kickstart the investigation is to profile. With the source code downloaded and debian patches applied (using Quilt, quite a hassle), I compile the project with debug flags to get ready for profiling using Valgrind/Callgrind. The command looks like this: valgrind --tool=callgrind ./html2text 005m.html

In the first run the functions are all addresses instead of function names, which is weird, since I compiled the project with debug flags already. Then I realize I called the system’s html2text instead of calling the compiled ./html2text. Rookie mistake.

So here’s the screenshot of the successful profiling results shown in KCacheGrind: before

Digging deeper

There’s a low-hanging fruit. These 2 functions utf_length and utf8_aux_count occupies most of the runtime. Notice that the call count of utf8_aux_count is crazy high but its Ir per call is as low as 19. Let’s look at the source code of utf8_aux_count.

/*           utf_length() and utf_width()       
 *
 *     Very simplified algorithm of calculating length of UTF-8
 *   string. No check for errors. Counting only ASCII bytes and
 *   leading bytes of UTF-8 multibyte sequences. All bytes like
 *   10xxxxxx are dropped. If USE_UTF8 is false then returns
 *   usual length.               --YS
 */

size_t utf8_aux_count(char ch)
{
	if((ch & 0xe0) == 0xc0)
	{
		return 1;
	}
	else if((ch & 0xf0) == 0xe0)
	{
		return 2;
	}
	else if ((ch & 0xf8) == 0xf0)
	{
		return 3;
	}
	else
	{
		return 0;
	}
}

Nothing special at utf8_aux_count.

Let’s look at utf_length.

unsigned int
Line::utf_length(size_type f, size_type t) const
{
	size_type m = (t < length_ ? t : length_);
	size_type r = m - f;
	if(USE_UTF8)
	{
		for (int i = f; i < m; i++)
		{
			char& ch = cells_[i].character;
			size_type aux_count = utf8_aux_count(ch);
			r -= aux_count;
			i += aux_count;
		}
	}
	return r;
}

utf_length looks cool too.

How about the caller of utf_length?

/*
 * Make up "line" into an Area. Attempt to return an Area no wider than "w".
 */
static Area *
make_up(const Line &line, Area::size_type w, int halign)
{

  if (line.empty()) return 0;

  auto_ptr<Area> res(new Area);

  Line::size_type from = 0;

  while (from < line.length()) {

    /*
     * A sole newline character has a special meaning: Append a blank line.
     */
    if (line[from].character == '\n') {
      res->resize(res->width(), res->height() + 1);
      from++;
      continue;
    }

    Line::size_type to = from + 1;
    int to_from;

    Line::size_type lbp = (Line::size_type) -1; // "Last break position".

    /*
     * Determine the line break position.
     */
    while (to < line.length()) {
      if (line[to].character == '\n') {
        break;
      }
      char c1 = line[to].character, c2 = line[to - 1].character;
      if (c1 == ' ' || c1 == '('/*)*/ || c1 == '['/*]*/ || c1 == '{'/*}*/ || (
        (
          c2 == '-' ||
          c2 == '/' ||
          c2 == ':'
        ) &&
        c1 != ',' &&
        c1 != '.' &&
        c1 != ';' &&
        c1 != ':'
      )) {
        lbp = to++;
        while (to < line.length() && line[to].character == ' ') to++;
      } else {
        to++;
      }

      if (line.utf_length(from,to) > w && lbp != (Area::size_type) -1) 
                    { to = lbp; break; }
    }

    to_from = line.utf_length(from,to);
    /*
     * Copy the "from...to" range from the "line" to the bottom of the "res"
     * Area.
     */
    Area::size_type x = 0;
    Area::size_type len = to - from;
    if (halign == Area::LEFT || to_from >= w) { ;                   } else
    if (halign == Area::CENTER)           { x += (w - to_from) / 2; } else
    if (halign == Area::RIGHT)            { x += w - to_from;       }
    res->insert(line.cells() + from, len, x, res->height());

    /*
     * Determine the beginnning of the next line.
     */
    if (to == line.length()) break;

    from = to;
    if (line[from].character == '\n') {
      ++from;
    } else
    if (line[from].character == ' ') {
      do {
        ++from;
      } while (from < line.length() && line[from].character == ' ');
    }
  }

  return res.release();
}

This line catches my eye.

      if (line.utf_length(from,to) > w && lbp != (Area::size_type) -1) 
                    { to = lbp; break; }

It is called inside a while loop with from unchanged and to increasing. Calling line.utf_length(from,to) is potentially doing many redundant calculations since utf_length(a,b) == utf_length(a,c) + utf_length(c,b). There is no need to measure from from every time. The redundant calls lead to a quadratic runtime.

Solution

The solution is to cache the old result and use it in the next calculation of line’s utf_length.

The code looks like this:

    Line::size_type last_to = from;
    int last_utf_len = 0;

    ... // Omitted

    while (to < line.length()) {
      
      ... // Omitted

      int utf_len = last_utf_len + line.utf_length(last_to, to);
      if (utf_len > w && lbp != (Area::size_type) -1)
                    { to = lbp; break; }

      last_to = to;
      last_utf_len = utf_len;
    }

The fix is simpler than it sounds.

Result

Now let’s profile the code with the same input again. after

The call count of utf_length remains the same while the call count of utf8_aux_count decreases drastically. With the same 0.05MB input, the runtime goes from 15s to 0.02s.

Blaming

By blaming, I mean something like git blame. Actually, the patch for utf8 support written in 2009 is to blame. This bug had been undetected for years.

Submitting a patch

Now that I fix it, I feel like submitting a patch such that everyone can use a non-ridiculously-slow version of html2text since utf8 is default on.

I use Quilt to prepare the patch and it makes me appreciate git more. I use reportbug to report to the debian package bug report page. I have submitted the report and patch for a few days now, no replies yet. UPDATE 20180709: The patch has been merged!

Afterword

Since my patch is not accepted yet, I look around the bug report page and find another utf-8 support patch which was from 2014. The patch’s author attached the whole nmu.diff here. The patch was too large that the maintainer was reluctant to accept. The fact is, the patch rewrote the utf-8 support and the new code doesn’t have the quadratic runtime problem. Too bad it has not been merged.

References

  1. html2text v1.3.2 project homepage
  2. html2text Debian version