Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page! Currently only showing blog entries under the category: Python. Clear filter

In action
A couple of weeks ago we had accidentally broken our production server (for a particular report) because of broken HTML. It was an unclosed tag which rendered everything after that tag to just plain white. Our comprehensive test suite failed to notice it because it didn't look at details like that. And when it was tested manually we simply missed the conditional situation when it was caused. Neither good excuses. So it got me thinking how can we incorporate HTML (html5 in particular) validation into our test suite.

So I wrote a little gist and used it a bit on a couple of projects and was quite pleased with the results. But I thought this might be something worthwhile to keep around for future projects or for other people who can't just copy-n-paste a gist.

With that in mind I put together a little package with a README and a setup.py and now you can use it too.

There are however some caveats. Especially if you intend to run it as part of your test suite.

Caveat number 1

You can't flood htmlvalidator.nu. Well, you can I guess. It would be really evil of you and kittens will die. If you have a test suite that does things like response = self.client.get(reverse('myapp:myview')) and there are many tests you might be causing an obscene amount of HTTP traffic to them. Which brings us on to...

Caveat number 2

The htmlvalidator.nu site is written in Java and it's open source. You can basically download their validator and point django-html-validator to it locally. Basically the way it works is java -jar vnu.jar myfile.html. However, it's slow. Like really slow. It takes about 2 seconds to run just one modest HTML file. So, you need to be patient.

Premailer is probably my most successful open source project in recent years. I base that on the fact that 25 different people have committed to it.

Today I merged a monster PR by Michael Jason Smith of OnlineGroups.net.

What it does is basically that it makes premailer work in Python 3, PyPy and Python 2.6. Check out the tox.ini file. Test coverage is still 100%.

If you look at the patch the core of the change is actually surprisingly little. The majority of the "secret sauce" is basically a bunch of import statements which are split by if sys.version_info >= (3, ): and some various minor changes around encoding UTF-8. The rest of the changes are basically test sit-ups.

A really interesting thing that hit us was that the code had assumptions about the order of things. Basically the tests assumed the the order of certain things in the resulting output was predictable even though it was done using a dict. dicts are famously unreliable in terms of the order you get things out and it's meant to be like that and it's a design choice. The reason it worked till now is not only luck but quite amazing.

Anyway, check it out. Now that we have a tox.ini file it should become much easier to run tests which I hope means patches will be better checked as they come in.

One of my most popular GitHub Open Source projects is premailer. It's a python library for combining HTML and CSS into HTML with all its CSS inlined into tags. This is a useful and necessary technique when sending HTML emails because you can't send those with an external CSS file (or even a CSS style tag in many cases).

The project has had 23 contributors so far and as always people come in get some itch they have scratched and then leave. I really try to get good test coverage and when people come with code I almost always require that it should come with tests too.

But sometimes you miss things. Also, this project was born as a weekend hack that slowly morphed into an actual package and its own repository and I bet there was code from that day that was never fully test covered.

So today I combed through the code and plugged all the holes where there wasn't test coverage.
Also, I set up Coveralls (project page) which is an awesome service that hooks itself up with Travis CI so that on every build and every Pull Request, the tests are run with --with-cover on nosetests and that output is reported to Coveralls.

The relevant changes you need to do are:

1) You need to go to coveralls.io (sign in with your GitHub account) and add the repo.
2) Edit your .travis.yml file to contain the following:

before_install:
    - pip install coverage
...
after_success:
    - pip install coveralls
    - coveralls

And you need to execute your tests so that coverage is calculated (the coverage module stores everything in a .coverage file which coveralls analyzes and sends). So in my case I change to this:

script:
    - nosetests premailer --with-cover --cover-erase --cover-package=premailer

3) You must also give coveralls some clues. So it reports on only the relevant files. Here's what mine looked like:

[run]
source = premailer

[report]
omit = premailer/test*

Now, I get to have a cute "coverage: 100%" badge in the README and when people post pull requests Coveralls will post a comment to reflect how the pull request changes the test coverage.

I am so grateful for all these wonderful tools. And it's all free too!

So I have a massive chunk of JSON that a Django view is sending to a piece of Angular that displays it nicely on the page. It's big. 674Kb actually. And it's likely going to be bigger in the near future. It's basically a list of dicts. It looks something like this:

>>> pprint(d['events'][0])
{u'archive_time': None,
 u'archive_url': u'/manage/events/archive/1113/',
 u'channels': [u'Main'],
 u'duplicate_url': u'/manage/events/duplicate/1113/',
 u'id': 1113,
 u'is_upcoming': True,
 u'location': u'Cyberspace - Pacific Time',
 u'modified': u'2014-08-06T22:04:11.727733+00:00',
 u'privacy': u'public',
 u'privacy_display': u'Public',
 u'slug': u'bugzilla-development-meeting-20141115',
 u'start_time': u'15 Nov 2014 02:00PM',
 u'start_time_iso': u'2014-11-15T14:00:00-08:00',
 u'status': u'scheduled',
 u'status_display': u'Scheduled',
 u'thumbnail': {u'height': 32,
                u'url': u'/media/cache/e7/1a/e71a58099a0b4cf1621ef3a9fe5ba121.png',
                u'width': 32},
 u'title': u'Bugzilla Development Meeting'}

So I thought one hackish simplification would be to convert each of these dicts into an list with a known sort order. Something like this:

>>> event = d['events'][0]
>>> pprint([event[k] for k in sorted(event)])
[None,
 u'/manage/events/archive/1113/',
 [u'Main'],
 u'/manage/events/duplicate/1113/',
 1113,
 True,
 u'Cyberspace - Pacific Time',
 u'2014-08-06T22:04:11.727733+00:00',
 u'public',
 u'Public',
 u'bugzilla-development-meeting-20141115',
 u'15 Nov 2014 02:00PM',
 u'2014-11-15T14:00:00-08:00',
 u'scheduled',
 u'Scheduled',
 {u'height': 32,
  u'url': u'/media/cache/e7/1a/e71a58099a0b4cf1621ef3a9fe5ba121.png',
  u'width': 32},
 u'Bugzilla Development Meeting']

So I converted my sample events.json file like that:

$ l -h events*
-rw-r--r--  1 peterbe  wheel   674K Aug  8 14:08 events.json
-rw-r--r--  1 peterbe  wheel   423K Aug  8 15:06 events.optimized.json

Excitingly the file is now 250Kb smaller because it no longer contains all those keys.

Now, I'd also send the order of the keys so I could do something like this in the AngularJS code:

 .success(function(response) {
   events = []
   response.events.forEach(function(event) {
     var new_event = {}
     response.keys.forEach(function(key, i) {
       new_event[k] = event[i]
     })
   })
 })

Yuck! Nested loops! It was just getting more and more complicated.
Also, if there are keys that are not present in every element, it means I'd have to replace them with None.

At this point I stopped and I could smell the hackish stink of sulfur of the hole I was digging myself into.
Then it occurred to me, gzip is really good at compressing repeated things which is something we have plenty of in a document store type data structure that a list of dicts is.

So I packed them manually to see what we could get:

$ apack events.json.gz events.json
$ apack events.optimized.json.gz events.optimized.json

And without further ado...

$ l -h events*
-rw-r--r--  1 peterbe  wheel   674K Aug  8 14:08 events.json
-rw-r--r--  1 peterbe  wheel    90K Aug  8 14:20 events.json.gz
-rw-r--r--  1 peterbe  wheel   423K Aug  8 15:06 events.optimized.json
-rw-r--r--  1 peterbe  wheel    81K Aug  8 15:07 events.optimized.json.gz

Basically, all that complicated and slow hoopla for saving 10Kb. No thank you.

Thank you gzip for existing!

Yesterday I had the good fortune to present Crontabber to the The San Francisco Bay Area PostgreSQL Meetup Group organized by my friend Josh Berkus.

To spare you having to watch the whole presentation I'm going to summarize some of it here.

My colleague Lars also has a blog post about Crontabber and goes into a bit more details about the nitty gritty of using PostgreSQL.

What is crontabber?

It's a tool for running cron jobs. It's written in Python and PostgreSQL and it's a tool we need for Socorro which has lots and lots of stored procedures and cron jobs.

So it's basically a script that gets started by UNIX crontab and runs every 5 minutes. Internally it keeps an index of all the apps it needs to run and it manages dependencies between jobs and is self-healing meaning that if something goes wrong during run-time it just retries again and again until it works. Amongst many of the juicy features it offers on top of regular "cron wrappers" is something we call "backfilling".

Backfill jobs are jobs that happen on a periodic basis and get given a date. If all is working this date is the same as "now()" but if something was to go wrong it remembers all the dates that did not work and re-attempts with those exact dates. That means that you can guarantee that the job gets started with every date possible even if it needs to catch up on past dates.

There's plenty of documentation on how to install and create jobs but it all starts with a simple pip install crontabber.

To get a feel for how you write crontabber "apps", checkout the ones for Socorro or flick through the slides in the PDF.

Is it mature?

Yes! It all started in early 2012 as a part of the Socorro code base and after some hard months of it stabalizing and maturing we decided to extract it out of Socorro and into its own project on GitHub under the Mozilla Public Licence 2.0 licence. Now it stands on its own legs and has no longer anything to do with Socorro and can be used for anything and anyone who has a lot of complicated cron jobs that need to run in Python with a PostgreSQL connection. In Socorro we use it primarily for executing stored procedures but that's just one type of app. You can also make it call out on to anything the command line with a "@with_subprocess" decorator.

Is it finished?

No. It works really well for us but there's a decent list of features we want to add. The hope is that by open sourcing it we can get other organizations to adopt it and not only find bugs but also contribute code to add more juicy features.

One of the soon-to-come features we have in mind is to "internalize locking". At the moment you have to wrap it in a bash script that prevents it from being run concurrently. Crontabber is single-threaded and we don't have to worry about "dependent jobs" starting before "parent jobs" because the depdendencies and their state is stored in the database. But you might need to worry about the same job (the one due next) to be started concurrently. By internalizing the locking we can store, in the state database, that a particular job is being started on and thus not have to worry about it starting the same job twice.

I really hope this project can grow and continue to support us in our needs.

grymt is a python tool that takes a directory full of .html, .css and .js and prepares the html for optimial production use.

For a teaser:

  1. Look at the "input"

  2. Look at the "output" (Note! You have to right-click and view source)

So why did I write my own tool and not use Grunt?!

Glad you asked! The reason is simple: I couldn't get Grunt to work.

Grunt is a framework. It's a place where you say which "recipes" to execute and how. It's effectively a common config framework. Like make.
However, I tried to set up a bunch of recipes in my Gruntfile.js and most of them worked well individually but it was a hellish nightmare to get it all to work together just the way I want it.

For example, the grunt-contrib-uglify is fine for doing the minification but it doesn't work with concatenation and it doesn't deal with taking one input file and outputting to a different file.
Basically, I spent two evenings getting things to work but I could never get exactly what I wanted. So I wrote my own and because I'm quite familiar with this kind of stuff, I did it in Python. Not because it's better than Node but just because I had it near by and was able to quicker build something.

So what sweet features do you get out of grymt?

  1. You can easily make an output file have a hash in the filename. E.g. vendor-$hash.min.js becomes vendor-64f7425.min.js and thus the filename is always unique but doesn't change in between deployments unless you change the files.

  2. It automatically notices which files already have been minified. E.g. no need to minify somelib.min.js but do minify otherlib.js.

  3. You can put $git_revision anywhere in your HTML and this gets expanded automatically. For example, view the source of buggy.peterbe.com and look at the first 20 lines.

  4. Images inside CSS get rewritten to have unique names (based on files' modified time) so they can be far-future cached aggresively too.

  5. You never have to write down any lists of file names in soome Gruntfile.js equivalent file

  6. It copies ALL files from a source directory. This is important in case you have something like this inside your javascript code: $('<img>').attr('src', 'picture.jpg') for example.

  7. You can chose to inline all the minified and concatenated CSS or javascript. Inlining CSS is neat for single page apps where you have a majority of primed cache hits. Instead of one .html and one .css you get just one .html and the amount of bytes is the same. Not having to do another HTTP request can save a lot of time on web performance.

  8. The generated (aka. "dist" directory) contains everything you need. It does not refer back to the source directory in any way. This means you can set up your apache/nginx to point directly at the root of your "dist" directory.

So what's the catch?

  1. It's not Grunt. It's not a framework. It does only what it does and if you want it to do more you have to work on grymt itself.

  2. The files you want to analyze, process and output all have to be in a sub directory.
    Look at how I've laid out the files here in this project for example. ALL files that you need is all in one sub-directory called app. So, to run grymt I simply run: grymt app.

  3. The HTML files you throw into it have to be plain HTML files. No templates for server-side code.

How do you use it?

pip install grymt

Then you need a directory it can process, e.g ./client/ (assumed to contain a .html file(s)).

grymt ./client

For more options, check out

grymt --help

What's in the future of grymt?

If people like it and want to add features, I'm more than happy to accept pull requests. Some future potential feature work:

  • I haven't needed it immediately, yet, myself, but it would be nice to add things like coffeescript, less, sass etc into pre-processing hooks.

  • It would be easy to automatically generate and insert a reference to a appcache manifest. Since every file used and mentioned is noticed, we could very accurately generate an appcache file that is less prone to human error.

  • Spitting out some stats about number bytes saved and number of files reduced.

My friend and colleague Jannis (aka jezdez) Leidel saved my bacon today where I had gotten completely stuck.

So, I have this python2.6 virtualenv and whenever I ran python setup.py sdist upload it would upload a really nasty tarball to PyPI. What would happen is that when people do pip install premailer it would file horribly and look something like this:

...
IOError: [Errno 2] No such file or directory: '/path/to/virtual-env/build/premailer/setup.py'

What?!?! If you download the tarball and unpack it you'll see that there definitely is a setup.py file in there.

Anyway. What happens, which I didn't realize was that within the .tar.gz file there were these strange copies of files. For example for every file.py there was a ._file.py etc.

Here's what the file looked like after a tarball had been created:

(premailer26)peterbe@mpb:~/dev/PYTHON/premailer (master)$ tar -zvtf dist/premailer-2.0.2.tar.gz
-rwxr-xr-x  0 peterbe staff     311 Apr 11 15:51 ./._premailer-2.0.2
drwxr-xr-x  0 peterbe staff       0 Apr 11 15:51 premailer-2.0.2/
-rw-r--r--  0 peterbe staff     280 Mar 28 10:13 premailer-2.0.2/._LICENSE
-rw-r--r--  0 peterbe staff    1517 Mar 28 10:13 premailer-2.0.2/LICENSE
-rw-r--r--  0 peterbe staff     280 Apr  9 21:10 premailer-2.0.2/._MANIFEST.in
-rw-r--r--  0 peterbe staff      34 Apr  9 21:10 premailer-2.0.2/MANIFEST.in
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/._PKG-INFO
-rw-r--r--  0 peterbe staff    7226 Apr 11 15:51 premailer-2.0.2/PKG-INFO
-rwxr-xr-x  0 peterbe staff     311 Apr 11 15:51 premailer-2.0.2/._premailer
drwxr-xr-x  0 peterbe staff       0 Apr 11 15:51 premailer-2.0.2/premailer/
-rwxr-xr-x  0 peterbe staff     311 Apr 11 15:51 premailer-2.0.2/._premailer.egg-info
drwxr-xr-x  0 peterbe staff       0 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/
-rw-r--r--  0 peterbe staff     280 Mar 28 10:13 premailer-2.0.2/._README.md
-rw-r--r--  0 peterbe staff    5185 Mar 28 10:13 premailer-2.0.2/README.md
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/._setup.cfg
-rw-r--r--  0 peterbe staff      59 Apr 11 15:51 premailer-2.0.2/setup.cfg
-rw-r--r--  0 peterbe staff     280 Apr  9 21:09 premailer-2.0.2/._setup.py
-rw-r--r--  0 peterbe staff    2079 Apr  9 21:09 premailer-2.0.2/setup.py
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/._dependency_links.txt
-rw-r--r--  0 peterbe staff       1 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/dependency_links.txt
-rw-r--r--  0 peterbe staff     280 Apr  9 21:04 premailer-2.0.2/premailer.egg-info/._not-zip-safe
-rw-r--r--  0 peterbe staff       1 Apr  9 21:04 premailer-2.0.2/premailer.egg-info/not-zip-safe
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/._PKG-INFO
-rw-r--r--  0 peterbe staff    7226 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/PKG-INFO
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/._requires.txt
-rw-r--r--  0 peterbe staff      23 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/requires.txt
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/._SOURCES.txt
-rw-r--r--  0 peterbe staff     329 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/SOURCES.txt
-rw-r--r--  0 peterbe staff     280 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/._top_level.txt
-rw-r--r--  0 peterbe staff      10 Apr 11 15:51 premailer-2.0.2/premailer.egg-info/top_level.txt
-rw-r--r--  0 peterbe staff     280 Apr  9 21:21 premailer-2.0.2/premailer/.___init__.py
-rw-r--r--  0 peterbe staff      66 Apr  9 21:21 premailer-2.0.2/premailer/__init__.py
-rw-r--r--  0 peterbe staff     280 Apr  9 09:23 premailer-2.0.2/premailer/.___main__.py
-rw-r--r--  0 peterbe staff    3315 Apr  9 09:23 premailer-2.0.2/premailer/__main__.py
-rw-r--r--  0 peterbe staff     280 Apr  8 16:22 premailer-2.0.2/premailer/._premailer.py
-rw-r--r--  0 peterbe staff   15368 Apr  8 16:22 premailer-2.0.2/premailer/premailer.py
-rw-r--r--  0 peterbe staff     280 Apr  8 16:22 premailer-2.0.2/premailer/._test_premailer.py
-rw-r--r--  0 peterbe staff   37184 Apr  8 16:22 premailer-2.0.2/premailer/test_premailer.py

Strangly, this only happened in a Python 2.6 environment. The problem went away when I created a brand new Python 2.7 enviroment with the latest setuptools.

So basically, the fault lies with OSX and a strange interaction between OSX and tar.
This superuser.com answer does a much better job explaining this "flaw".

So, the solution to the problem is to create the distribution like this instead:

$ COPYFILE_DISABLE=true python setup.py sdist

If you do that, you get a healthy lookin tarball that actually works to pip install. Thanks jezdez for pointing that out!

Because this bit me harder than I was ready for, I thought I'd make a note of it for the next victim.

In Python 2, suppose you have this:

Python 2.7.5
>>> items = [(1, 'A number'), ('a', 'A letter'), (2, 'Another number')]

Sorting them, without specifying how, will automatically notice that it contains tuples:

Python 2.7.5
>>> sorted(items)
[(1, 'A number'), (2, 'Another number'), ('a', 'A letter')]

This doesn't work in Python 3 because comparing integers and strings is not allowed. E.g.:

Python 3.3.3
>>> 1 < '1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unorderable types: int() < str()

You have to convert them to stings first.

Python 3.3.3
>>> sorted(items, key=lambda x: str(x[0]))
[(1, 'A number'), (2, 'Another number'), ('a', 'A letter')]

If you really need to sort by 1 < '1' this won't work. Then you need a more complex key function. E.g.:

Python 3.3.3
>>> def keyfunction(x):
...   v = x[0]
...   if isinstance(v, int): v = '0%d' % v
...   return v
...
>>> sorted(items, key=keyfunction)
[(1, 'A number'), (2, 'Another number'), ('1', 'Actually a string')]

That's really messy but the best I can come up with at past 4PM on Friday.

No, this is not about the new JSON Type added in Postgres 9.2. This is about how you can get a record set from a Postgres database into a JSON string the best way possible using Python.

Here's the traditional way:

>>> import json
>>> import psycopg2
>>>
>>> conn = psycopg2.connect('dbname=peterbecom')
>>> cur = conn.cursor()
>>> cur.execute("""
...   SELECT
...     id, oid, root, approved, name
...   FROM blogcomments
...   LIMIT 10
... """)
>>> columns = (
...     'id', 'oid', 'root', 'approved', 'name'
... )
>>> results = []
>>> for row in cur.fetchall():
...     results.append(dict(zip(columns, row)))
...
>>> print json.dumps(results, indent=2)
[
  {
    "oid": "comment-20030707-161847",
    "root": true,
    "id": 5662,
    "name": "Peter",
    "approved": true
  },
  {
    "oid": "comment-20040219-r4cf",
    "root": true,
    "id": 5663,
    "name": "silconscave",
    "approved": true
  },
  {
    "oid": "c091011r86x",
    "root": true,
    "id": 5664,
    "name": "Rachel Jay",
    "approved": true
  },
...

This is plain and nice but it's kinda annoying that you have to write down the columns you're selecting twice.
Also, it's annoying that you have to convert the results of fetchall() into a list of dicts in an extra loop.

So, there's a trick to the rescue! You can use the cursor_factory parameter. See below:

>>> import json
>>> import psycopg2
>>> from psycopg2.extras import RealDictCursor
>>>
>>> conn = psycopg2.connect('dbname=peterbecom')
>>> cur = conn.cursor(cursor_factory=RealDictCursor)
>>> cur.execute("""
...   SELECT
...     id, oid, root, approved, name
...   FROM blogcomments
...   LIMIT 10
... """)
>>>
>>> print json.dumps(cur.fetchall(), indent=2)
[
  {
    "oid": "comment-20030707-161847",
    "root": true,
    "id": 5662,
    "name": "Peter",
    "approved": true
  },
  {
    "oid": "comment-20040219-r4cf",
    "root": true,
    "id": 5663,
    "name": "silconscave",
    "approved": true
  },
  {
    "oid": "c091011r86x",
    "root": true,
    "id": 5664,
    "name": "Rachel Jay",
    "approved": true
  },
...

Isn't that much nicer? It's shorter and only lists the columns once.

But is it much faster? Sadly, no it's not. Not much faster. I ran various benchmarks comparing various ways of doing this and basically concluded that there's no significant difference. The latter one using RealDictCursor is around 5% faster. But I suspect all the time in the benchmark is spent doing things (the I/O) that is not different between the various versions.

Anyway. It's a keeper. I think it just looks nicer.

When you use a web framework like Tornado, which is single threaded with an event loop (like nodejs familiar with that), and you need persistency (ie. a database) there is one important questions you need to ask yourself:

Is the query fast enough that I don't need to do it asynchronously?

If it's going to be a really fast query (for example, selecting a small recordset by key (which is indexed)) it'll be quicker to just do it in a blocking fashion. It means less CPU work to jump between the events.

However, if the query is going to be potentially slow (like a complex and data intensive report) it's better to execute the query asynchronously, do something else and continue once the database gets back a result. If you don't all other requests to your web server might time out.

Another important question whenever you work with a database is:

Would it be a disaster if you intend to store something that ends up not getting stored on disk?

This question is related to the D in ACID and doesn't have anything specific to do with Tornado. However, the reason you're using Tornado is probably because it's much more performant that more convenient alternatives like Django. So, if performance is so important, is durable writes important too?

Let's cut to the chase... I wanted to see how different databases perform when integrating them in Tornado. But let's not just look at different databases, let's also evaluate different ways of using them; either blocking or non-blocking.

What the benchmark does is:

  • On one single Python process...
  • For each database engine...
  • Create X records of something containing a string, a datetime, a list and a floating point number...
  • Edit each of these records which will require a fetch and an update...
  • Delete each of these records...

I can vary the number of records ("X") and sum the total wall clock time it takes for each database engine to complete all of these tasks. That way you get an insert, a select, an update and a delete. Realistically, it's likely you'll get a lot more selects than any of the other operations.

And the winner is:

pymongo!! Using the blocking version without doing safe writes.

Fastest database for Tornado

Let me explain some of those engines

  • pymongo is the blocking pure python engine
  • with the redis, toredis and memcache a document ID is generated with uuid4, converted to JSON and stored as a key
  • toredis is a redis wrapper for Tornado
  • when it says (safe) on the engine it means to tell MongoDB to not respond until it has with some confidence written the data
  • motor is an asynchronous MongoDB driver specifically for Tornado
  • MySQL doesn't support arrays (unlike PostgreSQL) so instead the tags field is stored as text and transformed back and fro as JSON
  • None of these database have been tuned for performance. They're all fresh out-of-the-box installs on OSX with homebrew
  • None of these database have indexes apart from ElasticSearch where all things are indexes
  • momoko is an awesome wrapper for psycopg2 which works asyncronously specifically with Tornado
  • memcache is not persistant but I wanted to include it as a reference
  • All JSON encoding and decoding is done using ultrajson which should work to memcache, redis, toredis and mysql's advantage.
  • mongokit is a thin wrapper on pymongo that makes it feel more like an ORM
  • A lot of these can be optimized by doing bulk operations but I don't think that's fair
  • I don't yet have a way of measuring memory usage for each driver+engine but that's not really what this blog post is about
  • I'd love to do more work on running these benchmarks on concurrent hits to the server. However, with blocking drivers what would happen is that each request (other than the first one) would have to sit there and wait so the user experience would be poor but it wouldn't be any faster in total time.
  • I use the official elasticsearch driver but am curious to also add Tornado-es some day which will do asynchronous HTTP calls over to ES.

You can run the benchmark yourself

The code is here on github. The following steps should work:

$ virtualenv fastestdb
$ source fastestdb/bin/activate
$ git clone https://github.com/peterbe/fastestdb.git
$ cd fastestdb
$ pip install -r requirements.txt
$ python tornado_app.py

Then fire up http://localhost:8000/benchmark?how_many=10 and see if you can get it running.

Note: You might need to mess around with some of the hardcoded connection details in the file tornado_app.py.

Discussion

Before the lynch mob of HackerNews kill me for saying something positive about MongoDB; I'm perfectly aware of the discussions about large datasets and the complexities of managing them. Any flametroll comments about "web scale" will be deleted.

I think MongoDB does a really good job here. It's faster than Redis and Memcache but unlike those key-value stores, with MongoDB you can, if you need to, do actual queries (e.g. select all talks where the duration is greater than 0.5). MongoDB does its serialization between python and the database using a binary wrapper called BSON but mind you, the Redis and Memcache drivers also go to use a binary JSON encoding/decoder.

The conclusion is; be aware what you want to do with your data and what and where performance versus durability matters.

What's next

Some of those drivers will work on PyPy which I'm looking forward to testing. It should work with cffi like psycopg2cffi for example for PostgreSQL.

Also, an asynchronous version of elasticsearch should be interesting.