Gzip rules the world of optimization, often

09 August 2014   4 comments   Python, Javascript

Mind That Age!

This blog post is 4 years old! Most likely, its content is outdated. Especially if it's technical.

So I have a massive chunk of JSON that a Django view is sending to a piece of Angular that displays it nicely on the page. It's big. 674Kb actually. And it's likely going to be bigger in the near future. It's basically a list of dicts. It looks something like this:

>>> pprint(d['events'][0])
{u'archive_time': None,
 u'archive_url': u'/manage/events/archive/1113/',
 u'channels': [u'Main'],
 u'duplicate_url': u'/manage/events/duplicate/1113/',
 u'id': 1113,
 u'is_upcoming': True,
 u'location': u'Cyberspace - Pacific Time',
 u'modified': u'2014-08-06T22:04:11.727733+00:00',
 u'privacy': u'public',
 u'privacy_display': u'Public',
 u'slug': u'bugzilla-development-meeting-20141115',
 u'start_time': u'15 Nov 2014 02:00PM',
 u'start_time_iso': u'2014-11-15T14:00:00-08:00',
 u'status': u'scheduled',
 u'status_display': u'Scheduled',
 u'thumbnail': {u'height': 32,
                u'url': u'/media/cache/e7/1a/e71a58099a0b4cf1621ef3a9fe5ba121.png',
                u'width': 32},
 u'title': u'Bugzilla Development Meeting'}

So I thought one hackish simplification would be to convert each of these dicts into an list with a known sort order. Something like this:

>>> event = d['events'][0]
>>> pprint([event[k] for k in sorted(event)])
[None,
 u'/manage/events/archive/1113/',
 [u'Main'],
 u'/manage/events/duplicate/1113/',
 1113,
 True,
 u'Cyberspace - Pacific Time',
 u'2014-08-06T22:04:11.727733+00:00',
 u'public',
 u'Public',
 u'bugzilla-development-meeting-20141115',
 u'15 Nov 2014 02:00PM',
 u'2014-11-15T14:00:00-08:00',
 u'scheduled',
 u'Scheduled',
 {u'height': 32,
  u'url': u'/media/cache/e7/1a/e71a58099a0b4cf1621ef3a9fe5ba121.png',
  u'width': 32},
 u'Bugzilla Development Meeting']

So I converted my sample events.json file like that:

$ l -h events*
-rw-r--r--  1 peterbe  wheel   674K Aug  8 14:08 events.json
-rw-r--r--  1 peterbe  wheel   423K Aug  8 15:06 events.optimized.json

Excitingly the file is now 250Kb smaller because it no longer contains all those keys.

Now, I'd also send the order of the keys so I could do something like this in the AngularJS code:

 .success(function(response) {
   events = []
   response.events.forEach(function(event) {
     var new_event = {}
     response.keys.forEach(function(key, i) {
       new_event[k] = event[i]
     })
   })
 })

Yuck! Nested loops! It was just getting more and more complicated.
Also, if there are keys that are not present in every element, it means I'd have to replace them with None.

At this point I stopped and I could smell the hackish stink of sulfur of the hole I was digging myself into.
Then it occurred to me, gzip is really good at compressing repeated things which is something we have plenty of in a document store type data structure that a list of dicts is.

So I packed them manually to see what we could get:

$ apack events.json.gz events.json
$ apack events.optimized.json.gz events.optimized.json

And without further ado...

$ l -h events*
-rw-r--r--  1 peterbe  wheel   674K Aug  8 14:08 events.json
-rw-r--r--  1 peterbe  wheel    90K Aug  8 14:20 events.json.gz
-rw-r--r--  1 peterbe  wheel   423K Aug  8 15:06 events.optimized.json
-rw-r--r--  1 peterbe  wheel    81K Aug  8 15:07 events.optimized.json.gz

Basically, all that complicated and slow hoopla for saving 10Kb. No thank you.

Thank you gzip for existing!

Comments

Anonymous Coward
In my experiments with obese JSON, I found that lzma beats gzip and bzip2 on extremely small or large data. Otherwise, bzip2 always beats gzip. Unfortunately, only gzip is the standard across web browsers, right?
Peter Bengtsson
Yes, gzip is the only standard that browsers use.
gg
Hi, instead of using apack, you can use the zlib module in the standard library which would do the same from within python.
Stephen Chung
When you use GZip, all your "keys" are essentially free except for one copy. So your gzipped vs gzipped optimized should be the zipped sizes of all the keys, which you're going to send over the wire in one way or another. In other words, the optimized version has no benefit over the version with keys when gzipped.

Your email will never ever be published


Related posts

Previous:
Common names amongst my Facebook friends 26 June 2014
Next:
Aggressively prefetching everything you might click 20 August 2014
Related by Keyword:
Msgpack vs JSON (with gzip) 19 December 2017
Concurrent Gzip in Python 13 October 2017
Fastest way to match a filename's extension in Python 31 August 2017
Fastest Redis configuration for Django 11 May 2017
Fastest way to download a file from S3 29 March 2017
Related by Text:
jQuery and Highslide JS 08 January 2008
I'm back! Peterbe.com has been renewed 05 June 2005
Anti-McCain propaganda videos 12 August 2008
I'm Prolog 01 May 2007
Ever wondered how much $87 Billion is? 04 November 2003