Debugging Python Slow json.loads
23/Nov 2019Background
Profiling shows that pymongo’s bson.json_util.loads is consuming an unusual amount of CPU time.
Benchmark
To confirm the function is bad, let’s do some benchmarking.
import pyperf
def data():
import json
return json.dumps(['asdfasdf%s' % i for i in xrange(20)])
s = data()
runner = pyperf.Runner()
runner.timeit(name="json",
stmt="json.loads(s)",
setup="from __main__ import s; import json;")
runner.timeit(name="simplejson",
stmt="simplejson.loads(s)",
setup="from __main__ import s; import simplejson;")
runner.timeit(name="bson json_util",
stmt="json_util.loads(s)",
setup="from __main__ import s; from bson import json_util;")Result:
.....................
json: Mean +- std dev: 5.63 us +- 0.18 us
.....................
simplejson: Mean +- std dev: 3.32 us +- 0.11 us
.....................
bson json_util: Mean +- std dev: 8.77 us +- 0.19 usjson_util is slow. Something is wrong.
Digging into the source
bson.json_util.loads’s profiling shows that JSONDecoder.__init__ is taking ~15% of the time. On the other hand, this is not called in json.loads.
Let’s look into the source code of Python 2.7 json.loads:
# json.loads
def loads(s, encoding=None, cls=None, object_hook=None, parse_float=None,
parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
"""some docstring...
"""
if (cls is None and encoding is None and object_hook is None and
parse_int is None and parse_float is None and
parse_constant is None and object_pairs_hook is None and not kw):
return _default_decoder.decode(s)
if cls is None:
cls = JSONDecoder
if object_hook is not None:
kw['object_hook'] = object_hook
if object_pairs_hook is not None:
kw['object_pairs_hook'] = object_pairs_hook
if parse_float is not None:
kw['parse_float'] = parse_float
if parse_int is not None:
kw['parse_int'] = parse_int
if parse_constant is not None:
kw['parse_constant'] = parse_constant
return cls(encoding=encoding, **kw).decode(s)And look into bson.json_util.loads:
# bson.json_util.loads
def loads(s, *args, **kwargs):
"""some docstring
"""
json_options = kwargs.pop("json_options", DEFAULT_JSON_OPTIONS)
kwargs["object_pairs_hook"] = lambda pairs: object_pairs_hook(
pairs, json_options)
return json.loads(s, *args, **kwargs)There is actually a cached decoder if the settings are default. But bson.json_util.loads does not use default. Therefore, it is creating a JSONDecoder object and throwing it away after each loads function call.
Solution
If json.loads is called frequently with non-default options, cache the JSONDecoder for performance.
The fix looks like this:
import json
from bson import json_util
_my_decoder = json.JSONDecoder(object_hook=json_util.object_hook)
def fixed_loads(s):
return _my_decoder.decode(s)And here is the benchmark:
bson_fix: Mean +- std dev: 5.38 us +- 0.22 usLimitations
With the fix, it is not possible to change the decode settings on demand when calling loads.
Conclusion
Cache the JSONDecoder object to avoid wasteful object initialization and deallocation. The tradeoff is clear: cache 1 object vs ~40% performance difference. Beware of the limitation.