Gzip rules the world of optimization, often

09 August 2014   4 comments   Python, Javascript

Powered by Fusion×

So I have a massive chunk of JSON that a Django view is sending to a piece of Angular that displays it nicely on the page. It's big. 674Kb actually. And it's likely going to be bigger in the near future. It's basically a list of dicts. It looks something like this:

>>> pprint(d['events'][0])
{u'archive_time': None,
 u'archive_url': u'/manage/events/archive/1113/',
 u'channels': [u'Main'],
 u'duplicate_url': u'/manage/events/duplicate/1113/',
 u'id': 1113,
 u'is_upcoming': True,
 u'location': u'Cyberspace - Pacific Time',
 u'modified': u'2014-08-06T22:04:11.727733+00:00',
 u'privacy': u'public',
 u'privacy_display': u'Public',
 u'slug': u'bugzilla-development-meeting-20141115',
 u'start_time': u'15 Nov 2014 02:00PM',
 u'start_time_iso': u'2014-11-15T14:00:00-08:00',
 u'status': u'scheduled',
 u'status_display': u'Scheduled',
 u'thumbnail': {u'height': 32,
                u'url': u'/media/cache/e7/1a/e71a58099a0b4cf1621ef3a9fe5ba121.png',
                u'width': 32},
 u'title': u'Bugzilla Development Meeting'}

So I thought one hackish simplification would be to convert each of these dicts into an list with a known sort order. Something like this:

>>> event = d['events'][0]
>>> pprint([event[k] for k in sorted(event)])
 u'Cyberspace - Pacific Time',
 u'15 Nov 2014 02:00PM',
 {u'height': 32,
  u'url': u'/media/cache/e7/1a/e71a58099a0b4cf1621ef3a9fe5ba121.png',
  u'width': 32},
 u'Bugzilla Development Meeting']

So I converted my sample events.json file like that:

$ l -h events*
-rw-r--r--  1 peterbe  wheel   674K Aug  8 14:08 events.json
-rw-r--r--  1 peterbe  wheel   423K Aug  8 15:06 events.optimized.json

Excitingly the file is now 250Kb smaller because it no longer contains all those keys.

Now, I'd also send the order of the keys so I could do something like this in the AngularJS code:

 .success(function(response) {
   events = []
   response.events.forEach(function(event) {
     var new_event = {}
     response.keys.forEach(function(key, i) {
       new_event[k] = event[i]

Yuck! Nested loops! It was just getting more and more complicated.
Also, if there are keys that are not present in every element, it means I'd have to replace them with None.

At this point I stopped and I could smell the hackish stink of sulfur of the hole I was digging myself into.
Then it occurred to me, gzip is really good at compressing repeated things which is something we have plenty of in a document store type data structure that a list of dicts is.

So I packed them manually to see what we could get:

$ apack events.json.gz events.json
$ apack events.optimized.json.gz events.optimized.json

And without further ado...

$ l -h events*
-rw-r--r--  1 peterbe  wheel   674K Aug  8 14:08 events.json
-rw-r--r--  1 peterbe  wheel    90K Aug  8 14:20 events.json.gz
-rw-r--r--  1 peterbe  wheel   423K Aug  8 15:06 events.optimized.json
-rw-r--r--  1 peterbe  wheel    81K Aug  8 15:07 events.optimized.json.gz

Basically, all that complicated and slow hoopla for saving 10Kb. No thank you.

Thank you gzip for existing!


Anonymous Coward
In my experiments with obese JSON, I found that lzma beats gzip and bzip2 on extremely small or large data. Otherwise, bzip2 always beats gzip. Unfortunately, only gzip is the standard across web browsers, right?
Peter Bengtsson
Yes, gzip is the only standard that browsers use.
Hi, instead of using apack, you can use the zlib module in the standard library which would do the same from within python.
Stephen Chung
When you use GZip, all your "keys" are essentially free except for one copy. So your gzipped vs gzipped optimized should be the zipped sizes of all the keys, which you're going to send over the wire in one way or another. In other words, the optimized version has no benefit over the version with keys when gzipped.

Your email will never ever be published

Related posts

Common names amongst my Facebook friends 26 June 2014
Aggressively prefetching everything you might click 20 August 2014
Related by keywords:
Fastest way to uniqify a list in Python 14 August 2006
mincss "Clears the junk out of your CSS" 21 January 2013
Comparing Google Closure with UglifyJS 10 July 2011
Optimization of getting random rows out of a PostgreSQL in Django 23 February 2011
From Postgres to JSON strings 12 November 2013
Fastest way to thousands-commafy large numbers in Python/PyPy 13 October 2012
Migration of Postgres 9.2 to 9.3 with Homebrew and json_enhancements 30 April 2014
To JSON, Pickle or Marshal in Python 08 May 2009
HTML whitespace "compression" - don't bother! 11 March 2013
mincss in action - sample report from the wild 22 January 2013
Optimizing MozTrap 04 June 2014
Optimize Plone.org with slimmer.py 15 February 2005