Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page! Currently only showing blog entries under the category: Tornado. Clear filter

When you use a web framework like Tornado, which is single threaded with an event loop (like nodejs familiar with that), and you need persistency (ie. a database) there is one important questions you need to ask yourself:

Is the query fast enough that I don't need to do it asynchronously?

If it's going to be a really fast query (for example, selecting a small recordset by key (which is indexed)) it'll be quicker to just do it in a blocking fashion. It means less CPU work to jump between the events.

However, if the query is going to be potentially slow (like a complex and data intensive report) it's better to execute the query asynchronously, do something else and continue once the database gets back a result. If you don't all other requests to your web server might time out.

Another important question whenever you work with a database is:

Would it be a disaster if you intend to store something that ends up not getting stored on disk?

This question is related to the D in ACID and doesn't have anything specific to do with Tornado. However, the reason you're using Tornado is probably because it's much more performant that more convenient alternatives like Django. So, if performance is so important, is durable writes important too?

Let's cut to the chase... I wanted to see how different databases perform when integrating them in Tornado. But let's not just look at different databases, let's also evaluate different ways of using them; either blocking or non-blocking.

What the benchmark does is:

  • On one single Python process...
  • For each database engine...
  • Create X records of something containing a string, a datetime, a list and a floating point number...
  • Edit each of these records which will require a fetch and an update...
  • Delete each of these records...

I can vary the number of records ("X") and sum the total wall clock time it takes for each database engine to complete all of these tasks. That way you get an insert, a select, an update and a delete. Realistically, it's likely you'll get a lot more selects than any of the other operations.

And the winner is:

pymongo!! Using the blocking version without doing safe writes.

Fastest database for Tornado

Let me explain some of those engines

  • pymongo is the blocking pure python engine
  • with the redis, toredis and memcache a document ID is generated with uuid4, converted to JSON and stored as a key
  • toredis is a redis wrapper for Tornado
  • when it says (safe) on the engine it means to tell MongoDB to not respond until it has with some confidence written the data
  • motor is an asynchronous MongoDB driver specifically for Tornado
  • MySQL doesn't support arrays (unlike PostgreSQL) so instead the tags field is stored as text and transformed back and fro as JSON
  • None of these database have been tuned for performance. They're all fresh out-of-the-box installs on OSX with homebrew
  • None of these database have indexes apart from ElasticSearch where all things are indexes
  • momoko is an awesome wrapper for psycopg2 which works asyncronously specifically with Tornado
  • memcache is not persistant but I wanted to include it as a reference
  • All JSON encoding and decoding is done using ultrajson which should work to memcache, redis, toredis and mysql's advantage.
  • mongokit is a thin wrapper on pymongo that makes it feel more like an ORM
  • A lot of these can be optimized by doing bulk operations but I don't think that's fair
  • I don't yet have a way of measuring memory usage for each driver+engine but that's not really what this blog post is about
  • I'd love to do more work on running these benchmarks on concurrent hits to the server. However, with blocking drivers what would happen is that each request (other than the first one) would have to sit there and wait so the user experience would be poor but it wouldn't be any faster in total time.
  • I use the official elasticsearch driver but am curious to also add Tornado-es some day which will do asynchronous HTTP calls over to ES.

You can run the benchmark yourself

The code is here on github. The following steps should work:

$ virtualenv fastestdb
$ source fastestdb/bin/activate
$ git clone https://github.com/peterbe/fastestdb.git
$ cd fastestdb
$ pip install -r requirements.txt
$ python tornado_app.py

Then fire up http://localhost:8000/benchmark?how_many=10 and see if you can get it running.

Note: You might need to mess around with some of the hardcoded connection details in the file tornado_app.py.

Discussion

Before the lynch mob of HackerNews kill me for saying something positive about MongoDB; I'm perfectly aware of the discussions about large datasets and the complexities of managing them. Any flametroll comments about "web scale" will be deleted.

I think MongoDB does a really good job here. It's faster than Redis and Memcache but unlike those key-value stores, with MongoDB you can, if you need to, do actual queries (e.g. select all talks where the duration is greater than 0.5). MongoDB does its serialization between python and the database using a binary wrapper called BSON but mind you, the Redis and Memcache drivers also go to use a binary JSON encoding/decoder.

The conclusion is; be aware what you want to do with your data and what and where performance versus durability matters.

What's next

Some of those drivers will work on PyPy which I'm looking forward to testing. It should work with cffi like psycopg2cffi for example for PostgreSQL.

Also, an asynchronous version of elasticsearch should be interesting.

So, the cool thing about Tornado the Python web framework is that it's based on a single thread IO loop. Aka Eventloop. This means that you can handle high concurrency with optimal performance. However, it means that can't do things that take a long time because then you're blocking all other users.

The solution to the blocking problem is to then switch to asynchronous callbacks which means a task can churn away in the background whilst your web server can crack on with other requests. That's great but it's actually not that easy. Writing callback code in Tornado is much more pleasant than say, Node, where you actually have to "fork" off in different functions with different scope. For example, here's what it might look like:

class MyHandler(tornado.web.RequestHandler):
    @asynchronous
    @gen.engine
    def get(self):
        http_client = AsyncHTTPClient()
        response = yield gen.Task(http_client.fetch, "http://example.com")
        stuff = do_something_with_response(response)
        self.render("template.html", **stuff)

It's pretty neat but it's still work. And sometimes you just don't know if something is going to be slow or not. If it's not going to be slow (e.g. fetching a simple value from a fast local database) you don't want to do it async anyway.

So on Around The World I have a whole Admin dashboard where I edit questions, upload photos and run various statistical reports which can be all very slow. Since the only user who is using the Admin is me, I don't care if it's not very fast. So, I don't have to worry about wrapping things like thumbnail pre-generation in asynchronous callback code. But I don't either want to block the rest of the app where every single request has to be fast. Here's how I solve that.

First, I start 4 different processors across 4 different ports:

127.0.0.1:10001
127.0.0.1:10002
127.0.0.1:10003
127.0.0.1:10004

Then, I decide that 127.0.0.1:10004 will be dedicated to slow blocking ops only used by the Admin dashboard.

In Nginx it's easy. Here's the config I used (simplified for clarity)

upstream aroundtheworld_backends {
    server 127.0.0.1:10001;
    server 127.0.0.1:10002;
    server 127.0.0.1:10003;
}

upstream aroundtheworld_admin_backends {
    server 127.0.0.1:10004;
}

server {
    server_name aroundtheworldgame.com;
    root /var/lib/tornado/aroundtheworld;
    ...

    try_files /maintenance.html @proxy;

    location /admin {
        proxy_pass http://aroundtheworld_admin_backends;
    }

    location @proxy {
        proxy_pass http://aroundtheworld_backends;
    } 

    ...
}

With this in play it means that I can write slow blocking code without worry about blocking other users than myself.

This might sound lazy and that I should use an asynchronous library for all my DB, net and file system access but mind you that's not without its own risks and trouble. Most of the DB access is built to be very very simple queries that are always really light and almost always done over and database index that is fully in RAM. Doing it like this I can code away without much complexity and yet never have to worry about making the site slow.

UPDATE

I changed the Nginx config to use the try_files directive instead. Looks nicer.

In yesterdays DjangoCon BDFL Keynote Adrian Holovaty called out that Django needs a Real-Time story. Well, here's a response to that: django-sockjs-tornado

Immediately after the keynote I went and found a comfortable chair and wrote this app. It's basically a django app that allows you to run a socketserver with manage.py like this:

python manage.py socketserver

Chat Demo screenshot
Now, you can use all of SockJS to write some really flashy socket apps. In Django! Using Django models and stuff. The example included shows how to write a really simple chat application using Django models. check out the whole demo here

If you're curious about SockJS read the README and here's one of many good threads about the difference between SockJS and socket.io.

The reason I could write this app so quickly was because I have already written a production app using sockjs-tornado so the concepts were familiar. However, this app has (at the time of writing) not been used in any production. So mind you it might still need some more love before you show your mom your django app with WebSockets.

I just recently landed some patches on toocool that implements and interesting pattern that is seen more and more these days. I call it: Persistent caching with fire-and-forget updates

Basically, the implementation is this: You issue a request that requires information about a Twitter user: E.g. http://toocoolfor.me/following/chucknorris/vs/peterbe The app looks into its MongoDB for information about the tweeter and if it can't find this user it goes onto the Twitter REST API and looks it up and saves the result in MongoDB. The next time the same information is requested, and the data is available in the MongoDB it instead checks if the modify_date or more than an hour and if so, it sends a job to the message queue (Celery with Redis in my case) to perform an update on this tweeter.

You can basically see the code here but just to reiterate and abbreviate, it looks like this:

tweeter = self.db.Tweeter.find_one({'username': username})
if not tweeter:
   result = yield tornado.gen.Task(...)
   if result:
       tweeter = self.save_tweeter_user(result)
   else:
       # deal with the error!
elif age(tweeter['modify_date']) > 3600:
   tasks.refresh_user_info.delay(username, ...)
# render the template!

What the client gets, i.e. the user using the site, is it that apart from the very first time that URL is request is instant results but data is being maintained and refreshed.

This pattern works great for data that doesn't have to be up-to-date to the second but that still needs a way to cache invalidate and re-fetch. This works because my limit of 1 hour is quite arbitrary. An alternative implementation would be something like this:

tweeter = self.db.Tweeter.find_one({'username': username})
if not tweeter or (tweeter and age(tweeter) > 3600 * 24 * 7):
    # re-fetch from Twitter REST API
elif age(tweeter) > 3600:
    # fire-and-forget update

That way you don't suffer from persistently cached data that is too old.

Integrate BrowserID in a Tornado web app BrowserID is a new single sign-on initiative lead by Mozilla that takes a very refreshing approach to single sign-on. It's basically like OpenID except better and similar to the OAuth solutions from Google, Twitter, Facebook, etc but without being tied to those closed third-parties.

At the moment, BrowserID is ready for production (I have it on Kwissle) but the getting started docs is still something that is under active development (I'm actually contributing to this).

Anyway, I thought I'd share how to integrate it with Tornado

First, you need to do the client-side of things. I use jQuery but that's not a requirement to be able to use BrowserID. Also, there are different "patterns" to do login. Either you have a header that either says "Sign in"/"Hi Your Username". Or you can have a dedicated page (e.g. mysite.com/login/). Let's, for simplicity sake, pretend we build a dedicated page to log in. First, add the necessary HTML:

<a href="#" id="browserid" title="Sign-in with BrowserID">
  <img src="/images/sign_in_blue.png" alt="Sign in">
</a>
<script src="https://browserid.org/include.js" async></script>

Next you need the Javascript in place so that clicking on the link above will open the BrowserID pop-up:

function loggedIn(response) {
  location.href = response.next_url;
  /* alternatively you could do something like this instead:
       $('#header .loggedin').show().text('Hi ' + response.first_name);
    ...or something like that */
}

function gotVerifiedEmail(assertion) {
 // got an assertion, now send it up to the server for verification
 if (assertion !== null) {
   $.ajax({
     type: 'POST',
     url: '/auth/login/browserid/',
     data: { assertion: assertion },
     success: function(res, status, xhr) {
       if (res === null) {}//loggedOut();
       else loggedIn(res);
     },
     error: function(res, status, xhr) {
       alert("login failure" + res);
     }
   });
 }
 else {
   //loggedOut();
 }
}

$(function() {
  $('#browserid').click(function() {
    navigator.id.getVerifiedEmail(gotVerifiedEmail);
    return false;
  });
});

Next up is the server-side part of BrowserID. Your job is to take the assertion that is given to you by the AJAX POST and trade that with https://browserid.org for an email address:

import urllib
import tornado.web
import tornado.escape 
import tornado.httpclient
...

@route('/auth/login/browserid/')  # ...or whatever you use
class BrowserIDAuthLoginHandler(tornado.web.RequestHandler):

   def check_xsrf_cookie(self):  # more about this later
       pass

   @tornado.web.asynchronous
   def post(self):
       assertion = self.get_argument('assertion')
       http_client = tornado.httpclient.AsyncHTTPClient()
       domain = 'my.domain.com'  # MAKE SURE YOU CHANGE THIS
       url = 'https://browserid.org/verify'
       data = {
         'assertion': assertion,
         'audience': domain,
       }
       response = http_client.fetch(
         url,
         method='POST',
         body=urllib.urlencode(data),
         callback=self.async_callback(self._on_response)
       )

   def _on_response(self, response):
       struct = tornado.escape.json_decode(response.body)
       if struct['status'] != 'okay':
           raise tornado.web.HTTPError(400, "Failed assertion test")
       email = struct['email']
       self.set_secure_cookie('user', email,
                                  expires_days=1)
       self.set_header("Content-Type", "application/json; charset=UTF-8")
       response = {'next_url': '/'}
       self.write(tornado.escape.json_encode(response))
       self.finish()

Now that should get you up and running. There's of couse a tonne of things that can be improved. Number one thing to improve is to use XSRF on the AJAX POST. The simplest way to do that would be to somehow dump the XSRF token generated into your page and include it in the AJAX POST. Perhaps something like this:

<script>
var _xsrf = '{{ xsrf_token }}';
...
function gotVerifiedEmail(assertion) {
 // got an assertion, now send it up to the server for verification
 if (assertion !== null) {  
    $.ajax({
     type: 'POST',
     url: '/auth/login/browserid/',
     data: { assertion: assertion, _xsrf: _xsrf },
... 
</script>

Another thing that could obviously do with a re-write is the way users are handled server-side. In the example above I just set the asserted user's email address in a secure cookie. More realistically, you'll have a database of users who you match by email address but instead store their database ID in a cookie or something like that.

What's so neat about solutions such as OpenID, BrowserID, etc. is that you can combine two things in one process: Sign-in and Registration. In your app, all you need to do is a simple if statement in the code like this:

user = self.db.User.find_by_email(email) 
if not user:
    user = self.db.User()
    user.email = email
    user.save()
self.set_secure_cookie('user', str(user.id))

Hopefully that'll encourage a couple of more Tornadonauts to give BrowserID a try.

Too Cool For Me? Too Cool For Me? is a fun little side-project I've been working on. It's all about and only for Twitter. You login, then install a bookmarklet then when browsing twitter you can see who follows you and who is too cool for you.

For me it's a chance to try some new tech and at the same time scratch an itch I had. The results can be quite funny but also sad too when you realise that someone uncool isn't following you even though you follow him/her.

The code is open source and available on Github and at least it might help people see how to do a web app in Tornado using MongoDB and asynchronous requests to the Twitter API

This is Part 3 in a series of blogs about various bits and pieces in the tornado-utils package. Part 1 is here and part 2 is here

send_mail

Code is here

First of all, I should say: I didn't write much of this code. It's copied from Django and modified to work in any Tornado app. It hinges on the same idea as Django that you have to specify what backend you want to use. A backend is an importable string pointing to a class that has the send_messages method.

To begin, here's a sample use case inside a request handler:

class ContactUs(tornado.web.RequestHandler):

   def post(self):
       msg = self.get_argument('msg')
       # NB: you might want to set this once and for all in something 
       # like self.application.settings['email_backend']
       backend = 'tornado_utils.send_mail.backends.smtp.EmailBackend'
       send_email(backend,
                  "New contact form entry",
                  msg + '\n\n--\nFrom our contact form\n',
                  'noreply@example.com',
                  ['webmaster@example.com'],
                  )
       self.write("Thanks!")

The problem is that SMTP is slow. Even though, in human terms, it's fast, it's still too slow for a non-blocking server that Tornado is. Taking 1-2 seconds to send a message over SMTP means it's blocking every other request to Tornado for 1-2 seconds. The solution is instead save the message on disk in pickled form and use a cron job to pick up the messages and send them by SMTP instead, outside the Tornado process. First do this re-write:

   ... 
   def post(self):
       msg = self.get_argument('msg')
-       backend = 'tornado_utils.send_mail.backends.smtp.EmailBackend'
+       backend = 'tornado_utils.send_mail.backends.pickle.EmailBackend'
   ...

Now, write a cron job script that looks something like this:

# send_pickled_messages.py
DRY_RUN = False

def main():
   from tornado_utils.send_mail import config
   filenames = glob(os.path.join(config.PICKLE_LOCATION, '*.pickle'))
   filenames.sort()
   if not filenames:
       return

   from tornado_utils.send_mail import backends
   import cPickle

   if DRY_RUN:
       EmailBackend = backends.console.EmailBackend
   else:
       EmailBackend = backends.smtp.EmailBackend
   max_count = 10
   filenames = filenames[:max_count]
   messages = [cPickle.load(open(x, 'rb')) for x in filenames]
   backend = EmailBackend()
   backend.send_messages(messages)
   if not DRY_RUN:
       for filename in filenames:
           os.remove(filename)

That code just above is butchered from a more comprehensive script I have but you get the idea. Writing to a pickle file is so fast it's in the lower milliseconds region. However, it depends on disk IO so if you need more speed, write a simple backend that writes instead of saving pickles on disk, make it write to a fast in-memory database like Redis or Memcache.

The code isn't new and it's been battle tested but it's only really been battle tested in the way that my apps use it. So you might stumble across bugs if you use it in a way I haven't tested. However, the code is Open Source and happily available for you to help out and improve.

This is Part 2 in a series of blogs about various bits and pieces in the tornado-utils package. Part 1 is here

tornado_static

Code is here

This code takes care of two things: 1) optimizing your static resources and 2) bundling and serving them with unique names so you can cache aggressively.

The trick is to make your development environment such that there's no need to do anything when in "debug mode" but when in "production mode" it needs to be perfect. Which files (e.g. jquery.js or style.css) you use and bundle is up to you and it's something you control from the templates in your Tornado app. Not a config setting because, almost always, which resources (aka. assets) you need is known and relevant only to the templates where you're working.

Using UI modules in Tornado requires a bit of Tornado-fu but here's one example and here is another (untested) example:

# app.py

import tornado.web
from tornado_utils.tornado_static import (
  StaticURL, Static, PlainStaticURL, PlainStatic) 

class Application(tornado.web.Application):
   def __init__(self):
       ui_modules = {}
       if options.debug:
           ui_modules['Static'] = PlainStatic
           ui_modules['StaticURL'] = PlainStaticURL
       else:
           ui_modules['Static'] = Static
           ui_modules['StaticURL'] = StaticURL

       app_settings = dict(
           template_path="templates",
           static_path="static",
           ui_modules=ui_modules,
           debug=options.debug,
           UGLIFYJS_LOCATION='~/bin/uglifyjs',
           CLOSURE_LOCATION="static/compiler.jar",
           YUI_LOCATION="static/yuicompressor-2.4.2.jar",
           cdn_prefix="cdn.mycloud.com",
       )

       handlers = [
         (r"/", HomeHandler),
         (r"/entry/([0-9]+)", EntryHandler),
       ] 
       super(Application, self).__init__(handlers, **app_settings)

def main(): # pragma: no cover
   tornado.options.parse_command_line()
   http_server = tornado.httpserver.HTTPServer(Application())
   http_server.listen(options.port)
   tornado.ioloop.IOLoop.instance().start()

if __name__ == "__main__":
   main()

Note! If you're looking to optimize your static resources in a Tornado app you probably already have a "framework" for setting up UI modules into your app. The above code is just to wet your appetite and to show how easy it is to set up. The real magic starts to happen in the template code. Here's a sample implementation:

<!doctype html>
<html>
   <head>
       <title>{% block title %}{{ page_title }}{% end %}</title>
       <meta charset="utf-8">
       {% module Static("css/ext/jquery.gritter.css", "css/style.css") %}
   </head>
   <body>
      <header>...</header>
   {% block content %}
   {% end %}
   {% module Static("js/ext/head.load.min.js") %}
   <script>
   var POST_LOAD = '{% module StaticURL("js/dojo.js", "js/dojo.async.js") %}';
   </script>
   </body>
</html>

What you get when run is a template that looks like this:

<!doctype html>
<html>
   <head>
       <title>My title</title>
       <meta charset="utf-8">
<link rel="stylesheet" type="text/css" href="//cdn.mycloud.com/combined/jquery.gritter.css.style.1313206609.css">
   </head>
   ...

(Note that this will create a whitespace optimized filed called "jquery.gritter.css.style.1313206609.css" in "/tmp/combined")

Have a play and see if it makes sense in your app. I do believe this can do with some Open Source love but so far it works great for me on Kwissle, DoneCal and TooCoolFor.Me

This is Part 1 in a series of blogs about various bits and pieces in the tornado-utils package.

tornado-utils is the result of me working on numerous web apps in Tornado that after too much copy-and-paste from project to project eventually became a repo standing on its own two legs.

TestClient

Code is here

This makes it possible to write integration tests that executes GET and POST requests on your app similarly to how Django does it. Example implementation:

# tests/base.py

from tornado_utils.http_test_client import TestClient, HTTPClientMixin
from tornado.testing import LogTrapTestCase, AsyncHTTPTestCase

class BaseAsyncTestCase(AsyncHTTPTestCase, LogTrapTestCase):
   pass 

class BaseHTTPTestCase(BaseAsyncTestCase, HTTPClientMixin):

   def setUp(self):
       super(BaseHTTPTestCase, self).setUp()
       self.client = TestClient(self)

# tests/test_handlers.py

from .base import BaseHTTPTestCase

class HandlersTestCase(BaseHTTPTestCase):

   def setUp(self):
       super(HandlersTestCase, self).setUp()
       self._create_user('peterbe', 'secret')  # use your imagination

   def test_homepage(self):
       response = self.client.get('/')
       self.assertEqual(response.code, 200)
       self.assertTrue('stranger' in response.body)

       data = {'username': 'peterbe', 'password': 'secret'}
       response = self.client.post('/login/', data)
       self.assertEqual(response.code, 302)

       response = self.client.get('/')
       self.assertEqual(response.code, 200)
       self.assertTrue('stranger' not in response.body)
       self.assertTrue('hi peterbe' in response.body)

You can see a sample implementation of this here

Note that this was one of the first pieces of test code I wrote in my very first Tornado app and it's not unlikely that some assumptions and approaches are naive or even wrong but what we have here works and it makes the test code very easy to read. All it basically does is wraps the http_client.fetch(...) call and also maintains a bucket of cookies

I hope it can be useful to someone new to writing tests in Tornado.

Launching Kwissle.com For the past couple of months I've been working on a little fun side-project that I've decided to call: Kwissle

It's an online game where you're paired up randomly with someone else who's online on the site at the same time and you enter a battle of general knowledge questions.

The rules are simple:

  • each battle has 10 questions
  • you get 15 seconds per question
  • if you type in the right answer (spelling mistakes are often allowed) you get 3 points
  • if you load the alternatives and get it right you get 1 point
  • you have to be fast or else your opponent might steal the question
  • whoever has the most points win!

The app has been quite a technical challenge because it's real-time. Like a chat. To accomplish this, I've decided to use the socket.io library with Tornado which means that if your browser supports it, it's using WebSockets or else if falls back on Flash sockets.

Because there is no easy way to source questions, I've instead opted to do it by crowd sourcing. What that simply means is that YOU help out adding the questions. You can also help out by reviewing other peoples questions before I eventually publish them.

Bear with me though; this is an early alpha release and it just about works. There are a tonne of features and improvements I want to add to make it more fun and solid but, like I said, this is a side-project so the only time I get to work on it is late evenings and weekends. The source code is not open but I'm going to try to be entirely open with what's going on on a tumblr blog

So, without further ado; head over and start playing