A blog and website by Peter Bengtsson

I have and have had many sites that I run. They're all some form of side-project.

What they almost all have in common is two things

  1. They have very little traffic (thus not particularly mission critical)
  2. I run everything on one server (no need for "spinning up" new VMs here and there)

Many many years ago, when current interns I work with were mere babies, I started a very simple "procedure".

  1. On the server, in the user directory where the site is deployed, I write a script called something like which is executable and does what the name of the script is: it upgrades the site.

  2. In the server's root home directory I write a script called which also does exactly what the name of the script is: it restarts the service.

  3. On my laptop, in my ~/bin directory I create a script called (*) which runs on the server and runs also on the server.

And here is, if I may say so, the cleverness of this; I use ssh to execute these scripts remotely by simply piping the commands to ssh. For example:

echo "./" | ssh -A
echo "./" | ssh

That's an example I use for Wish List Granted.

This works so darn well, and has done for years, that this is why I've never really learned to use more advanced tools like Fabric, Salt, Puppet, Chef or <insert latest deployment tool name>.

This means that all I need to do run a deployment is just type[ENTER] and the simple little bash scripts takes care of everything else.

The reason I keep these on the server and not on my laptop is simply because that's where they naturally belong and if I'm ssh'ed in and mess around I don't have to exit out to re-run them.

Here's an example of the I use for Wish List Granted:

cd generousfriends
source venv/bin/activate
git pull origin master
find . | grep '\.pyc$' | xargs rm -f
pip install -r requirements/prod.txt
./ syncdb --noinput
./ migrate webapp.main
./ collectstatic --noinput
./ compress --force
echo "Restart must be done by root"

I hope that, by blogging about this, that someone else sees that it doesn't really have to be that complicated. It's not rocket science and most complex tools are only really needed when you have a significant bigger scale in terms of people- and skill-complexity.

In conclusion

Keep it simple.

(*) The reason for the capitalization of my scripts is also an old habit. I use that habit to differentiate my scripts for stuff I install from any third parties.

On Wednesday this week, I managed to get a link to Wish List Granted onto Hacker News. It had enough upvotes to be featured on the front page for a couple of hours. I'm very grateful for the added traffic but not quite so impressed with the ultimate conversions.

  • 4,428 unique visitors
  • 43 Wish Lists created
  • 2 Usersnap pieces of constructive feedback
  • 0 payments made

Google Analytics
So that's 1% conversion of people setting up a wish list. But kinda disappointing that no body ever made a payment. Actually, one friend did make a payment. But he's a colleague and a friend so not a stranger who stumbled onto it from Hacker News.

Also, it's now been 3 days since those 43 wish lists were created and still no payments. That's kinda disappointing too.

I'm starting to fear that Wish List Granted is one of those ideas that people think it's a great idea but have no interest in using.

I built something. It's called Wish List Granted.

It's a mash-up using's Wish List functionality. What you do is you hook up your Amazon wish list onto and pick one item. Then you share that page with friends and familiy and they can then contribute a small amount each. When the full amount is reached, Wish List Granted will purchase the item and send it to you.

The Rules page has more details if you're interested.

The problem it tries to solve is that you have friends would want something and even if it's a good friend you might be hesitant to spend $50 on a gift to them. I'm sure you can afford it but if you have many friends it gets unpractical. However, spending $5 is another matter. Hopefully Wish List Granted solves that problem.

Wish List Granted started as one of those insomnia late-night project. I first wrote a scraper using pyQuery then a couple of Django models and views and then tied it up by integrating Balanced Payments. It was actually working on the first night. Flawed but working start to finish.

When it all started, I used Persona to require people to authenticate to set up a Wish List. After some thought I decided to ditch that and use "email authentication" meaning they have to enter an email address and click a secure link I send to them.

One thing I'm very proud of about Wish List Granted is that it does NOT store any passwords, any credit cards or any personal shipping addresses. Despite being so totally void of personal data I thought it'd look nicer if the whole site is on HTTPS.

More information on the Help & Frequently Asked Questions page.

I looked around for Javascript libs that do automatic input formatting for credit card inputs.

The first one was formatter.js which looked promising but it weighs over 6Kb minified and also, when you apply it the placeholder attribute you have on the input disappears.

So, in true software engineering fashion I wrote my own:

function cc_format(value) {
  var v = value.replace(/\s+/g, '').replace(/[^0-9]/gi, '')
  var matches = v.match(/\d{4,16}/g);
  var match = matches && matches[0] || ''
  var parts = []
  for (i=0, len=match.length; i<len; i+=4) {
    parts.push(match.substring(i, i+4))
  if (parts.length) {
    return parts.join(' ')
  } else {
    return value

And some tests to prove it:

assert(cc_format('1234') === '1234')
assert(cc_format('123456') === '1234 56')
assert(cc_format('123456789') === '1234 5678 9')
assert(cc_format('') === '')
assert(cc_format('1234 1234 5') === '1234 1234 5')
assert(cc_format('1234 a 1234x 5') === '1234 1234 5')

Check out the Demo

This has served me well of the last couple of years of using Django:

from django import forms

class _BaseForm(object):
    def clean(self):
        cleaned_data = super(_BaseForm, self).clean()
        for field in cleaned_data:
            if isinstance(cleaned_data[field], basestring):
                cleaned_data[field] = (
                    cleaned_data[field].replace('\r\n', '\n')
                    .replace(u'\u2018', "'").replace(u'\u2019', "'").strip())

        return cleaned_data

class BaseModelForm(_BaseForm, forms.ModelForm):

class BaseForm(_BaseForm, forms.Form):

So instead of doing...

class SigupForm(forms.Form):
    name = forms.CharField(max_length=100)
    nick_name = forms.CharField(max_length=100, required=False) do:

class SigupForm(BaseForm):
    name = forms.CharField(max_length=100)
    nick_name = forms.CharField(max_length=100, required=False)

What it does is that it makes sure that any form field that takes a string strips all preceeding and trailing whitespace. It also replaces the strange "curved" apostrophe ticks that Microsoft Windows sometimes uses.

Yes, this might all seem trivial and I'm sure there's something as good or better out there but isn't it a nice thing to never have to worry about doing things like this again:

class SignupForm(forms.Form):

    def clean_name(self):
        return self.cleaned_data['name'].strip()


form = SignupForm(request.POST)
if form.is_valid():
    name = form.cleaned_data['name'].strip()


This breaks some fields, like DateField.

>>> class F(BaseForm):
...     start_date = forms.DateField()
...     def clean_start_date(self):
...         return self.cleaned_data['start_date']
>>> f=F({'start_date': '2013-01-01'})
>>> f.is_valid()
>>> f.cleaned_data['start_date']
datetime.datetime(2013, 1, 1, 0, 0)

As you can see, it cleans up '2013-01-01' into datetime.datetime(2013, 1, 1, 0, 0) when it should become, 1, 1).

Not sure why yet.

No, this is not about the new JSON Type added in Postgres 9.2. This is about how you can get a record set from a Postgres database into a JSON string the best way possible using Python.

Here's the traditional way:

>>> import json
>>> import psycopg2
>>> conn = psycopg2.connect('dbname=peterbecom')
>>> cur = conn.cursor()
>>> cur.execute("""
...   SELECT
...     id, oid, root, approved, name
...   FROM blogcomments
...   LIMIT 10
... """)
>>> columns = (
...     'id', 'oid', 'root', 'approved', 'name'
... )
>>> results = []
>>> for row in cur.fetchall():
...     results.append(dict(zip(columns, row)))
>>> print json.dumps(results, indent=2)
    "oid": "comment-20030707-161847",
    "root": true,
    "id": 5662,
    "name": "Peter",
    "approved": true
    "oid": "comment-20040219-r4cf",
    "root": true,
    "id": 5663,
    "name": "silconscave",
    "approved": true
    "oid": "c091011r86x",
    "root": true,
    "id": 5664,
    "name": "Rachel Jay",
    "approved": true

This is plain and nice but it's kinda annoying that you have to write down the columns you're selecting twice.
Also, it's annoying that you have to convert the results of fetchall() into a list of dicts in an extra loop.

So, there's a trick to the rescue! You can use the cursor_factory parameter. See below:

>>> import json
>>> import psycopg2
>>> from psycopg2.extras import RealDictCursor
>>> conn = psycopg2.connect('dbname=peterbecom')
>>> cur = conn.cursor(cursor_factory=RealDictCursor)
>>> cur.execute("""
...   SELECT
...     id, oid, root, approved, name
...   FROM blogcomments
...   LIMIT 10
... """)
>>> print json.dumps(cur.fetchall(), indent=2)
    "oid": "comment-20030707-161847",
    "root": true,
    "id": 5662,
    "name": "Peter",
    "approved": true
    "oid": "comment-20040219-r4cf",
    "root": true,
    "id": 5663,
    "name": "silconscave",
    "approved": true
    "oid": "c091011r86x",
    "root": true,
    "id": 5664,
    "name": "Rachel Jay",
    "approved": true

Isn't that much nicer? It's shorter and only lists the columns once.

But is it much faster? Sadly, no it's not. Not much faster. I ran various benchmarks comparing various ways of doing this and basically concluded that there's no significant difference. The latter one using RealDictCursor is around 5% faster. But I suspect all the time in the benchmark is spent doing things (the I/O) that is not different between the various versions.

Anyway. It's a keeper. I think it just looks nicer.

As of moving over to my new EC2 server I now have all my working sites all under one server.

If I list all sites in /etc/nginx/sites-enabled/ I count 14 sites. This blog being one of many. More listed here.

All but one of these services are Python. One is a Node server. About half of the Python services are Django and the other half is Tornado. There are four persistant databases (Postgres, Redis, Memcache, MongoDB) and two message queues (RabbitMQ and Python RQ).

I have this little script called which does a decent job summorizing how much memory all of these take. Its output currently looks like this:

 Private  +   Shared  =  RAM used   Program
  6.5 MiB +  27.3 MiB =  33.7 MiB   postgres (5)
 40.1 MiB +  58.0 KiB =  40.1 MiB   memcached
 54.7 MiB +  37.5 KiB =  54.7 MiB   redis-server
 72.2 MiB + 849.0 KiB =  73.1 MiB   mongod
 82.4 MiB +   1.5 MiB =  83.9 MiB   rqworker (10)
605.6 MiB + 350.9 MiB = 956.5 MiB   python (61)
  1.9 GiB +  51.2 MiB =   2.0 GiB   uwsgi-core (26)
                          3.3 GiB                       

It's sorted by "RAM used" and I just showed here the bottom 7 ones.
Anyway, 3.3 Gb to run 14 sites isn't bad. All through one Nginx (which only uses 10Mb by the way).

The server is a Debian 7 on a reserved Large instance. I'll try to post an update later about this server with more details. I have a lot of work to do to set up all monitoring and backups for all these things.

I've started experimenting with my home page to make it load even faster.

Amazon famously does this too which you can read more about in this Steve Souders post. They make sure everything that needs to be visible above the fold is loaded first, then, it starts loading all the other "stuff" below the fold. The assumption is that the user requests the page, watches it render and some time after it has rendered reaches for the mouse and starts scrolling down for more content. Or perhaps, never bothers to scroll down at all. Either way, everyhing below the fold can wait. We have more time, to load that in, later.

What we want to avoid is a load graph like this:

big html document delays loading other stuff

The graph is deliberately zoomed out so that we don't get stuck on the details of that particular graph. But basically, you have a very heavy document to load which needs to be fully loaded (and partially rendered) before it can load all other stuff that that page entails. As you can see, the first load (the HTML document) is taking up a majority of the load time. Once that's downloaded the browser can start parsing it an start rendering it. Simultanously it can start downloading all the mentioned resources such as images, javascript and CSS.

On WebPagetest they call this Speed Index; "The Speed Index is the average time at which visible parts of the page are displayed."
So basically, you want to display as much as you possibly can and then load in other things that are necessary but can wait in the background.

So, how did I accomplish this on my site?

Basically, the home page uses as piece of Django code that picks up the 10 most recent blog posts and includes them into the template. Instead, I made it only pick up the first 2 and then after window.onload a piece if AJAX code loads the HTML for the remaining 8 blog posts.
That means that much less is required to load the home page. The page is smaller and references less images. The AJAX code is very crude and simple but works enough:

onload = function() {
  microAjax("/rest/2/10/", function (res) {
    document.getElementById('rest').innerHTML = res;

The user probably won't notice a huge difference if she avoids looking at the loading spinner of her browser. Only if she is really really fast at scrolling down will she notice that the rest of the page (about 80% of its vertical space) comes in a little bit later.

So, did it work?

I hope so! The theory is sound. However, my home page is, unlike an product page, very sparse. The page weighs a total of 77Kb (excluding external resources) but now only the first 25Kb is loaded and the rest later.

Here's a measurement before and one after. It's kinda hard to compare because "fluctuations" on network I/O make measurements like this quite unpredictable. Also, there's various odd requests like New Relic and Google Analytics which clouds the waterfall view. However, what really matters is in the "First View" of the after measurement. If you look closely you'll see that now a bunch of images aren't loaded until after the "Document Complete" event has fired. That, to me, is a big win.

Below the fold

If you're interested in how it was done, check out this changeset.

When you use a web framework like Tornado, which is single threaded with an event loop (like nodejs familiar with that), and you need persistency (ie. a database) there is one important questions you need to ask yourself:

Is the query fast enough that I don't need to do it asynchronously?

If it's going to be a really fast query (for example, selecting a small recordset by key (which is indexed)) it'll be quicker to just do it in a blocking fashion. It means less CPU work to jump between the events.

However, if the query is going to be potentially slow (like a complex and data intensive report) it's better to execute the query asynchronously, do something else and continue once the database gets back a result. If you don't all other requests to your web server might time out.

Another important question whenever you work with a database is:

Would it be a disaster if you intend to store something that ends up not getting stored on disk?

This question is related to the D in ACID and doesn't have anything specific to do with Tornado. However, the reason you're using Tornado is probably because it's much more performant that more convenient alternatives like Django. So, if performance is so important, is durable writes important too?

Let's cut to the chase... I wanted to see how different databases perform when integrating them in Tornado. But let's not just look at different databases, let's also evaluate different ways of using them; either blocking or non-blocking.

What the benchmark does is:

  • On one single Python process...
  • For each database engine...
  • Create X records of something containing a string, a datetime, a list and a floating point number...
  • Edit each of these records which will require a fetch and an update...
  • Delete each of these records...

I can vary the number of records ("X") and sum the total wall clock time it takes for each database engine to complete all of these tasks. That way you get an insert, a select, an update and a delete. Realistically, it's likely you'll get a lot more selects than any of the other operations.

And the winner is:

pymongo!! Using the blocking version without doing safe writes.

Fastest database for Tornado

Let me explain some of those engines

  • pymongo is the blocking pure python engine
  • with the redis, toredis and memcache a document ID is generated with uuid4, converted to JSON and stored as a key
  • toredis is a redis wrapper for Tornado
  • when it says (safe) on the engine it means to tell MongoDB to not respond until it has with some confidence written the data
  • motor is an asynchronous MongoDB driver specifically for Tornado
  • MySQL doesn't support arrays (unlike PostgreSQL) so instead the tags field is stored as text and transformed back and fro as JSON
  • None of these database have been tuned for performance. They're all fresh out-of-the-box installs on OSX with homebrew
  • None of these database have indexes apart from ElasticSearch where all things are indexes
  • momoko is an awesome wrapper for psycopg2 which works asyncronously specifically with Tornado
  • memcache is not persistant but I wanted to include it as a reference
  • All JSON encoding and decoding is done using ultrajson which should work to memcache, redis, toredis and mysql's advantage.
  • mongokit is a thin wrapper on pymongo that makes it feel more like an ORM
  • A lot of these can be optimized by doing bulk operations but I don't think that's fair
  • I don't yet have a way of measuring memory usage for each driver+engine but that's not really what this blog post is about
  • I'd love to do more work on running these benchmarks on concurrent hits to the server. However, with blocking drivers what would happen is that each request (other than the first one) would have to sit there and wait so the user experience would be poor but it wouldn't be any faster in total time.
  • I use the official elasticsearch driver but am curious to also add Tornado-es some day which will do asynchronous HTTP calls over to ES.

You can run the benchmark yourself

The code is here on github. The following steps should work:

$ virtualenv fastestdb
$ source fastestdb/bin/activate
$ git clone
$ cd fastestdb
$ pip install -r requirements.txt
$ python

Then fire up http://localhost:8000/benchmark?how_many=10 and see if you can get it running.

Note: You might need to mess around with some of the hardcoded connection details in the file


Before the lynch mob of HackerNews kill me for saying something positive about MongoDB; I'm perfectly aware of the discussions about large datasets and the complexities of managing them. Any flametroll comments about "web scale" will be deleted.

I think MongoDB does a really good job here. It's faster than Redis and Memcache but unlike those key-value stores, with MongoDB you can, if you need to, do actual queries (e.g. select all talks where the duration is greater than 0.5). MongoDB does its serialization between python and the database using a binary wrapper called BSON but mind you, the Redis and Memcache drivers also go to use a binary JSON encoding/decoder.

The conclusion is; be aware what you want to do with your data and what and where performance versus durability matters.

What's next

Some of those drivers will work on PyPy which I'm looking forward to testing. It should work with cffi like psycopg2cffi for example for PostgreSQL.

Also, an asynchronous version of elasticsearch should be interesting.

I'm now off by about two months but in June 2003 I posted my first ever blog post.

My first website was launched in 1997 but that one is long lost. The next version, which actually used a database and a real web framework was launched in 2001 and this is the oldest screenshot I could find.

A really old version of my blog
Back then the site was built in Zope which at the time was the coolest shit you could possibly use. Back in 2003 I was renting a room in an apartment in London when I was studying at City University. The broad band (american's know this as DSL) we had had a static IP address so I could tie my domain name directly to my bedroom basically. If you're born in the nineties or anything sooner you wouldn't remember this but for almost 20 years you could either buy a laptop (small but slow) or a stationary computer (clunky but fast) and this laptop I was running on was no exception. Not to mention it was an abandonned laptop too. I think it had about 8 MB of RAM. I ran a stripped down version of Debian on it without any graphical interface. I managed the code by scp'ing files into it from my Windows computer.

Anyway, running on a home DSL line with on a rusty old laptop blinking away under my bed meant that site would be ultra-slow if I didn't pre-optimize it. And that was something I did. The site had a Squid cache in front of it and the HTML, CSS and Javascript was compressed by a script I wrote called slimmer.

Back in 2003 blogging was getting hotter than celebrity spotting and I was very much interested in something that later became called "SEO" and the rumor at the time was that "blogs" got penalized by Google because blogs usually just re-posted stuff from real web pages. So I decided to prefix all my content with the word "plog". It's was a mix of "p" for Peter and sufficiently different from the word "blog".

In the first couple of years of blogging I would blog about all sorts of stuff that caught my interested. Not just genuine thoughts or real technology notes but any fun link I came across. That became a massive trend later (and still is I guess) by the giants like Digg and Reddit so I stopped doing that with my own blog. In the last 7 years (give or take) I only blog about things that are genuinely close to heart or something I've actually worked on.

Some stats:

Total number of blog posts: 949
Total number of approved blog comments: 8,086
Number of email addresses collected: 4,292
Maximum number of comments on any one post: 2,749
Number of Cease or Desist letters received: 1

To me, blogging used to be a form of shouting out to the world what I found interesting in the hope that you'll also find it interesting and that you'll thank me for finding that. Now it's a way for me of either documenting something I've learned recently or some other announcement that is related to what I do on some technical thing.
I wonder how this will change for me in the next 10 years.