Debugging Python Slow json.loads
23/Nov 2019Background
Profiling shows that pymongo
’s bson.json_util.loads
is consuming an unusual amount of CPU time.
Benchmark
To confirm the function is bad, let’s do some benchmarking.
import pyperf
def data():
import json
return json.dumps(['asdfasdf%s' % i for i in xrange(20)])
s = data()
runner = pyperf.Runner()
runner.timeit(name="json",
stmt="json.loads(s)",
setup="from __main__ import s; import json;")
runner.timeit(name="simplejson",
stmt="simplejson.loads(s)",
setup="from __main__ import s; import simplejson;")
runner.timeit(name="bson json_util",
stmt="json_util.loads(s)",
setup="from __main__ import s; from bson import json_util;")
Result:
.....................
json: Mean +- std dev: 5.63 us +- 0.18 us
.....................
simplejson: Mean +- std dev: 3.32 us +- 0.11 us
.....................
bson json_util: Mean +- std dev: 8.77 us +- 0.19 us
json_util is slow. Something is wrong.
Digging into the source
bson.json_util.loads
’s profiling shows that JSONDecoder.__init__
is taking ~15% of the time. On the other hand, this is not called in json.loads
.
Let’s look into the source code of Python 2.7 json.loads
:
# json.loads
def loads(s, encoding=None, cls=None, object_hook=None, parse_float=None,
parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
"""some docstring...
"""
if (cls is None and encoding is None and object_hook is None and
parse_int is None and parse_float is None and
parse_constant is None and object_pairs_hook is None and not kw):
return _default_decoder.decode(s)
if cls is None:
cls = JSONDecoder
if object_hook is not None:
kw['object_hook'] = object_hook
if object_pairs_hook is not None:
kw['object_pairs_hook'] = object_pairs_hook
if parse_float is not None:
kw['parse_float'] = parse_float
if parse_int is not None:
kw['parse_int'] = parse_int
if parse_constant is not None:
kw['parse_constant'] = parse_constant
return cls(encoding=encoding, **kw).decode(s)
And look into bson.json_util.loads
:
# bson.json_util.loads
def loads(s, *args, **kwargs):
"""some docstring
"""
json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS)
kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(
pairs, json_options)
return json.loads(s, *args, **kwargs)
There is actually a cached decoder if the settings are default. But bson.json_util.loads
does not use default. Therefore, it is creating a JSONDecoder
object and throwing it away after each loads
function call.
Solution
If json.loads
is called frequently with non-default options, cache the JSONDecoder
for performance.
The fix looks like this:
import json
from bson import json_util
_my_decoder = json.JSONDecoder(object_hook=json_util.object_hook)
def fixed_loads(s):
return _my_decoder.decode(s)
And here is the benchmark:
bson_fix: Mean +- std dev: 5.38 us +- 0.22 us
Limitations
With the fix, it is not possible to change the decode settings on demand when calling loads
.
Conclusion
Cache the JSONDecoder
object to avoid wasteful object initialization and deallocation. The tradeoff is clear: cache 1 object vs ~40% performance difference. Beware of the limitation.