Msgpack vs JSON (with gzip)

19 December 2017   12 comments   Python, Web development

Powered by Fusion×

tl;dr; I see no reason worth switching to Msgpack instead of good old JSON.

I was curious, how much more efficient is Msgpack at packing a bunch of data into a file I can emit from a web service.

In this experiment I take a massive JSON file that is used in a single-page-app I worked on. If I download the file locally as a .json file, the file is 2.1MB.

Converting it to Msgpack:

>>> import json, msgpack
>>> with open('events.json') as f:
...   events=json.load(f)
...
>>> len(events)
3
>>> events.keys()
dict_keys(['max_modified', 'events', 'urls'])
>>> with open('events.msgpack', 'wb') as f:
...   f.write(msgpack.packb(events))
...
1880266

events.json vs events.msgpack
Now, let's compared the two file formats, as seen on disk:

▶ ls -lh events*
-rw-r--r--  1 peterbe  wheel   2.1M Dec 19 10:16 events.json
-rw-r--r--  1 peterbe  wheel   1.8M Dec 19 10:19 events.msgpack

But! How well does it compress?

More common than not your web server can return content encoded in Gzip as content-encoding: gzip. So, let's compare that:

▶ gzip events.json ; gzip events.msgpack
▶ ls -l events*
-rw-r--r--  1 peterbe  wheel  304416 Dec 19 10:16 events.json.gz
-rw-r--r--  1 peterbe  wheel  305905 Dec 19 10:19 events.msgpack.gz

Msgpack vs JSON (with gzip)

Oh my! When you gzip the files the .json file ultimately becomes smaller. By a whopping 0.5%!

What about speed?

First let's open the files a bunch of times and see how long it takes to unpack:

def f1():
    with open('events.json') as f:
        s = f.read()
    t0 = time.time()
    events = json.loads(s)
    t1 = time.time()
    assert len(events['events']) == 4365
    return t1 - t0


def f2():
    with open('events.msgpack', 'rb') as f:
        s = f.read()
    t0 = time.time()
    events = msgpack.unpackb(s, encoding='utf-8')
    t1 = time.time()
    assert len(events['events']) == 4365
    return t1 - t0


def f3():
    with open('events.json') as f:
        s = f.read()
    t0 = time.time()
    events = ujson.loads(s)
    t1 = time.time()
    assert len(events['events']) == 4365
    return t1 - t0

(Note that the timing is around the json.loads() etc without measuring how long it takes to get the files to strings)

json.loads() vs msgpack.unpack() vs. ujson.loads()
Result (using Python 3.6.1): All about the same.

FUNCTION: f1 Used 56 times
    MEDIAN 30.509352684020996
    MEAN   31.09178798539298
    STDEV  3.5620914333233595
FUNCTION: f2 Used 68 times
    MEDIAN 27.882099151611328
    MEAN   28.704492484821994
    STDEV  3.353800228776872
FUNCTION: f3 Used 76 times
    MEDIAN 27.746915817260742
    MEAN   27.920340236864593
    STDEV  2.21554251130519

Same benchmark using PyPy 3.5.3, but skipping the f3() which uses ujson:

FUNCTION: f1 Used 99 times
    MEDIAN 20.905017852783203
    MEAN   22.13949386519615
    STDEV  5.142071370453135
FUNCTION: f2 Used 101 times
    MEDIAN 36.96393966674805
    MEAN   40.54664857316725
    STDEV  17.833577642246738

Dicussion and conclusion

One of the benefits of Msgpack is that it can used for streaming. "Streaming unpacking" as they call it. But, to be honest, I've never used it. That can useful when you have structured data trickling in and you don't want to wait for it all before using the data.

Another cool feature Msgpack has is ability to encode custom types. E.g. datetime.datetime. Like bson can do. With JSON you have to, for datetime objects do string conversions back and forth and the formats are never perfectly predictable so you kinda have to control both ends.

But beyond some feature differences, it seems that JSON compressed just as well as Msgpack when Gzipped. And unlike Msgpack JSON is not binary so it's easy to poke around with any tool. And decompressing JSON is just as fast. Almost. But if you need to squeeze out a couple of extra free milliseconds from your JSON files you can use ujson.

Conclusion; JSON is fine. It's bigger but if you're going to Gzip anyway, it's just as small as Msgpack.

Bonus! BSON

Another binary encoding format that supports custom types is BSON. This one is a pure Python implementation. BSON is used by MongoDB but this bson module is not what PyMongo uses.

Size comparison:

▶ ls -l events*son
-rw-r--r--  1 peterbe  wheel  2315798 Dec 19 11:07 events.bson
-rw-r--r--  1 peterbe  wheel  2171439 Dec 19 10:16 events.json

So it's 7% larger than JSON uncompressed.

▶ ls -l events*son.gz
-rw-r--r--  1 peterbe  wheel  341595 Dec 19 11:07 events.bson.gz
-rw-r--r--  1 peterbe  wheel  304416 Dec 19 10:16 events.json.gz

Meaning it's 12% fatter than JSON when Gzipped.

Doing a quick benchmark with this:

def f4():
    with open('events.bson', 'rb') as f:
        s = f.read()
    t0 = time.time()
    events = bson.loads(s)
    t1 = time.time()
    assert len(events['events']) == 4365
    return t1 - t0

Compared to the original f1() function:

FUNCTION: f1 Used 106 times
    MEDIAN 29.58393096923828
    MEAN   30.289863640407347
    STDEV  3.4766612593557173
FUNCTION: f4 Used 94 times
    MEDIAN 231.00042343139648
    MEAN   231.40889786659403
    STDEV  8.947746458066405

In other words, bson is about 600% slower than json.

This blog post was supposed to be about how well the individual formats size up against each other on disk but it certainly would be interesting to do a speed benchmark comparing Msgpack and JSON (and maybe BSON) where you have a bunch of datetimes or decimal.Decimal objects and see if the difference is favoring the binary formats.

Comments

Anonymous
What's wrong with XML? Actually definable data types and such.
Peter Bengtsson
You can define types in JSON too if really want to. XML and JSON is just one big string after all. It's all down to you how to de-serialize it. JSON has some built in "standards", if you can call it that. With XML you have define that yourself, in some specific way to your specific tools.
With JSON, an integer makes sense between a Python and Ruby program.

Also, XML is extremely verbose.
eugene skepner
I think BSON itself is fast, it was designed to be accessed directly, by copying it into dict you kinda misuse BSON. If for some reason you want to use standard dict to access data, then BSON is not a format to consider.
Arseniy Terekhin
It is interesting topic to me. I want to store neural network layers and right now deciding, should I support msgpack or not. If your events.json is mostly strings, than no wonder json ≈ msgpack. It's interesting how json and msgpack compares when the data is mostly arrays of floats.
Lahache Stéphane
There are also MsgPack official alternative: CBOR (http://cbor.io/)
It is unfortunate that this RFC so little known ...
Peter Bengtsson
There are plenty of non-strings in that events.json but, yes, it's generally mostly strings.
Peter Bengtsson
Does it have a good Python module?
Lahache Stéphane
pretty good yes. It's as fast as ujson, for the same size.
Just make "pip install cbor", and you will have the high-speed implementation (with C implementation)

For another python implementations (or other languages), see more here: http://cbor.io/impls.html

i quickly performed a bench, and it gaves me these stats:
##########
*Encoding*
json -> op by sec: 86725.66920628605, (len:828)
rapidjson -> op by sec: 180454.7353457841, (len:768)
ujson -> op by sec: 190558.97336411255, (len:776)
msgpack -> op by sec: 153912.86095711155, (len:636)
cbor -> op by sec: 207190.78744576086, (len:635)
##########
*Decoding*
json -> op by sec: 91466.83280991958
rapidjson -> op by sec: 110461.44729824932
ujson -> op by sec: 129011.48500404017
msgpack -> op by sec: 243895.1547813709
cbor -> op by sec: 168211.05628806681
Lahache Stéphane
* same size of MsgPack
Tim Caswell
I find msgpack really shines when you design your datasets and protocols around heavy use of arrays and integers. If your data is string heavy, there is little reason to use it. Also the elephant in the room, what about large binary values? It's annoying to encode Dates in JSON, but it's really annoying to have to encode large binary values as giant base64 strings.
A Jesse Jiryu Davis
Have you tried the official bson module that's included in PyMongo? PyMongo's bson is implemented in C, it's surely much faster than the pure-Python module you tried.
Peter Bengtsson
No, because I couldn't find it outside pymongo.
Thank you for posting a comment

Your email will never ever be published


Related posts

Previous:
Another win for Tracking Protection in Firefox 13 December 2017
Next:
CSS selector simplifier regular expression in JavaScript 20 December 2017
Related by Keyword:
Concurrent Gzip in Python 13 October 2017
Fastest *local* cache backend possible for Django 04 August 2017
Fastest Redis configuration for Django 11 May 2017
Fastest way to download a file from S3 29 March 2017
Cope with JSONDecodeError in requests.get().json() in Python 2 and 3 16 November 2016
Related by Text:
How to do performance micro benchmarks in Python 24 June 2017
Gzip rules the world of optimization, often 09 August 2014
Fastest way to find out if a file exists in S3 (with boto3) 16 June 2017
Fastest way to download a file from S3 29 March 2017
Fastest way to thousands-commafy large numbers in Python/PyPy 13 October 2012