tl;dr; I see no reason worth switching to Msgpack instead of good old JSON.
I was curious, how much more efficient is Msgpack at packing a bunch of data into a file I can emit from a web service.
In this experiment I take a massive JSON file that is used in a single-page-app I worked on. If I download the file locally as a
.json file, the file is 2.1MB.
Converting it to Msgpack:
>>> import json, msgpack >>> with open('events.json') as f: ... events=json.load(f) ... >>> len(events) 3 >>> events.keys() dict_keys(['max_modified', 'events', 'urls']) >>> with open('events.msgpack', 'wb') as f: ... f.write(msgpack.packb(events)) ... 1880266
▶ ls -lh events* -rw-r--r-- 1 peterbe wheel 2.1M Dec 19 10:16 events.json -rw-r--r-- 1 peterbe wheel 1.8M Dec 19 10:19 events.msgpack
More common than not your web server can return content encoded in Gzip as
content-encoding: gzip. So, let's compare that:
▶ gzip events.json ; gzip events.msgpack ▶ ls -l events* -rw-r--r-- 1 peterbe wheel 304416 Dec 19 10:16 events.json.gz -rw-r--r-- 1 peterbe wheel 305905 Dec 19 10:19 events.msgpack.gz
Oh my! When you gzip the files the
.json file ultimately becomes smaller. By a whopping 0.5%!
First let's open the files a bunch of times and see how long it takes to unpack:
def f1(): with open('events.json') as f: s = f.read() t0 = time.time() events = json.loads(s) t1 = time.time() assert len(events['events']) == 4365 return t1 - t0 def f2(): with open('events.msgpack', 'rb') as f: s = f.read() t0 = time.time() events = msgpack.unpackb(s, encoding='utf-8') t1 = time.time() assert len(events['events']) == 4365 return t1 - t0 def f3(): with open('events.json') as f: s = f.read() t0 = time.time() events = ujson.loads(s) t1 = time.time() assert len(events['events']) == 4365 return t1 - t0
(Note that the timing is around the
json.loads() etc without measuring how long it takes to get the files to strings)
FUNCTION: f1 Used 56 times MEDIAN 30.509352684020996 MEAN 31.09178798539298 STDEV 3.5620914333233595 FUNCTION: f2 Used 68 times MEDIAN 27.882099151611328 MEAN 28.704492484821994 STDEV 3.353800228776872 FUNCTION: f3 Used 76 times MEDIAN 27.746915817260742 MEAN 27.920340236864593 STDEV 2.21554251130519
Same benchmark using PyPy 3.5.3, but skipping the
f3() which uses
FUNCTION: f1 Used 99 times MEDIAN 20.905017852783203 MEAN 22.13949386519615 STDEV 5.142071370453135 FUNCTION: f2 Used 101 times MEDIAN 36.96393966674805 MEAN 40.54664857316725 STDEV 17.833577642246738
One of the benefits of Msgpack is that it can used for streaming. "Streaming unpacking" as they call it. But, to be honest, I've never used it. That can useful when you have structured data trickling in and you don't want to wait for it all before using the data.
Another cool feature Msgpack has is ability to encode custom types. E.g.
datetime.datetime. Like bson can do. With JSON you have to, for datetime objects do string conversions back and forth and the formats are never perfectly predictable so you kinda have to control both ends.
But beyond some feature differences, it seems that JSON compressed just as well as Msgpack when Gzipped. And unlike Msgpack JSON is not binary so it's easy to poke around with any tool. And decompressing JSON is just as fast. Almost. But if you need to squeeze out a couple of extra free milliseconds from your JSON files you can use
Conclusion; JSON is fine. It's bigger but if you're going to Gzip anyway, it's just as small as Msgpack.
▶ ls -l events*son -rw-r--r-- 1 peterbe wheel 2315798 Dec 19 11:07 events.bson -rw-r--r-- 1 peterbe wheel 2171439 Dec 19 10:16 events.json
So it's 7% larger than JSON uncompressed.
▶ ls -l events*son.gz -rw-r--r-- 1 peterbe wheel 341595 Dec 19 11:07 events.bson.gz -rw-r--r-- 1 peterbe wheel 304416 Dec 19 10:16 events.json.gz
Meaning it's 12% fatter than JSON when Gzipped.
Doing a quick benchmark with this:
def f4(): with open('events.bson', 'rb') as f: s = f.read() t0 = time.time() events = bson.loads(s) t1 = time.time() assert len(events['events']) == 4365 return t1 - t0
Compared to the original
FUNCTION: f1 Used 106 times MEDIAN 29.58393096923828 MEAN 30.289863640407347 STDEV 3.4766612593557173 FUNCTION: f4 Used 94 times MEDIAN 231.00042343139648 MEAN 231.40889786659403 STDEV 8.947746458066405
In other words,
bson is about 600% slower than
This blog post was supposed to be about how well the individual formats size up against each other on disk but it certainly would be interesting to do a speed benchmark comparing Msgpack and JSON (and maybe BSON) where you have a bunch of datetimes or
decimal.Decimal objects and see if the difference is favoring the binary formats.