A blog and website by Peter Bengtsson

Weight of your PostgreSQL tables "lumped together"

31 October 2015 0 comments   PostgreSQL

We have lots of tables that weigh a lot. Some of the tables are partitions so they're called "mytable_20150901" and "mytable_20151001" etc.

To find out how much each table weighs you can use this query:

select table_name, pg_relation_size(table_name), pg_size_pretty(pg_relation_size(table_name))
from information_schema.tables
where table_schema = 'public'
order by 2 desc limit 10;

It'll give you an output like this:

        table_name        | pg_relation_size | pg_size_pretty
 raw_adi_logs             |      14724538368 | 14 GB
 raw_adi                  |      14691426304 | 14 GB
 tcbs                     |       7173865472 | 6842 MB
 exploitability_reports   |       6512738304 | 6211 MB
 reports_duplicates       |       4428742656 | 4224 MB
 addresses                |       4120412160 | 3930 MB
 missing_symbols_20150601 |       3264897024 | 3114 MB
 missing_symbols_20150608 |       3170762752 | 3024 MB
 missing_symbols_20150622 |       3039731712 | 2899 MB
 missing_symbols_20150615 |       2967281664 | 2830 MB
(10 rows)

But as you can see in this example, it might be interesting to know what the sum is of all the missing_symbols_* partitions.

Without further ado, here's how you do that:

select table_name, total, pg_size_pretty(total)
from (
  select trim(trailing '_0123456789' from table_name) as table_name, 
  sum(pg_relation_size(table_name)) as total
  from information_schema.tables
  where table_schema = 'public'
  group by 1
) as agg
order by 2 desc limit 10;

Then you'll get possibly very different results:

        table_name        |    total     | pg_size_pretty
 reports_user_info        | 157111115776 | 146 GB
 reports_clean            | 106995695616 | 100 GB
 reports                  | 100983242752 | 94 GB
 missing_symbols          |  42231529472 | 39 GB
 raw_adi_logs             |  14724538368 | 14 GB
 raw_adi                  |  14691426304 | 14 GB
 extensions               |  12237242368 | 11 GB
 tcbs                     |   7173865472 | 6842 MB
 exploitability_reports   |   6512738304 | 6211 MB
 signature_summary_uptime |   6027468800 | 5748 MB
(10 rows)

You can read more about the trim() function here.

How to "onchange" in ReactJS

21 October 2015 9 comments   Javascript, ReactJS

Normally, in vanilla Javascript, the onchange event is triggered after you have typed something into a field and then "exited out of it", e.g. click outside the field so the cursor isn't blinking in it any more. This for example

document.querySelector('input').onchange = function(event) {
  document.querySelector('code').textContent =;

First of all, let's talk about what this is useful for. One great example is a sign-up form where you have to pick a username or type in an email address or something. Before the user gets around to pressing the final submit button you might want to alert them early that their chosen username is available or already taken. Or you might want to alert early that the typed in email address is not a valid one. If you execute that kind of validation on every key stroke, it's unlikely to be a pleasant UI.

Problem is, you can't do that in ReactJS. It doesn't work like that. The explanation is quite non-trivial:

*"<input type="text" value="Untitled"> renders an input initialized with the value, Untitled. When the user updates the input, the node's value property will change. However, node.getAttribute('value') will still return the value used at initialization time, Untitled.

Unlike HTML, React components must represent the state of the view at any point in time and not only at initialization time."*

Basically, you can't easily rely on the input field because the state needs to come from the React app's state, not from the browser's idea of what the value should be.

You might try this

var Input = React.createClass({
  getInitialState: function() {
    return {typed: ''};
  onChange: function(event) {
  render: function() {
    return <div>
        <input type="text" onChange={this.onChange.bind(this)}/>
        You typed: <code>{this.state.typed}</code>
React.render(<Input/>, document.querySelector('div'));

But what you notice is the the onChange handler is fired on every key stroke. Not just when the whole input field has changed.

So, what to do?

The trick is surprisingly simple. Use onBlur instead!

Same snippet but using onBlur instead

var Input = React.createClass({
  getInitialState: function() {
    return {typed: ''};
  onBlur: function(event) {
  render: function() {
    return <div>
        <input type="text" onBlur={this.onBlur.bind(this)}/>
        You typed: <code>{this.state.typed}</code>
React.render(<Input/>, document.querySelector('div'));

Now, your handler is triggered after the user has finished with the field.

And bash basics

16 October 2015 2 comments   Linux, MacOSX

It's one of those things; not hard to understand and certainly not an advanced trick but I sometimes see people miss out on this.

In bash there are sort of two ways of saying "Do this and then do that". You can either say "Do this and no matter what happens then do that" or you can say "Do this and if that worked also do that".


Suppose you have two command executables you want to run. They can succeed or fail.

$ echo "Do this and no matter what happens then do that"
$ ./command1 ; ./command2

If you run that, ./command2 will run even if ./command1 failed.
The other one is...

$ echo "Do this and if that worked also do that"
$ ./command1 && ./command2

You might recognize the && thing from JavaScript or Java or C or one of those. If you recognize it you might quickly also conclude that you can do this too:

$ echo "Do this and only if it failed do that"
$ ./command1 || ./command2

In this latter case only one of those (or none!) will succeed.

So when does this come in handy?

Here are some examples that I often use:

Meaning, I know my code is good to push, iff the tests pass

$ nosetests && git commit -a -m "some feature" && git push peterbe mybranch

Or if you might want to be alerted if something failed after the first command slowly takes its time to finish:

$ nosetests && say "Tests finished" || say "Work harder"

(say is an OSX specific command and not a built-in in bash)

The ; is useful when you don't care if the first command finished and this is more rare. For example:

$ rm static/ ; ./ collectstatic --noinput

Why bother?

Perhaps it goes without saying, the reason for doing all of these is generally when the first command takes a long time and you don't want to sit and wait till it's finished to run the second time. By "piping them together" like this, the second command will safely start as soon as possible whilst you go away and pay attention to something else.

mozjpeg installation and sample

10 October 2015 3 comments   Linux, Web development, Mozilla

I've written about mozjpeg before where I showed what it can do to a sample directory full of different kinds of JPEGs. But let's get more real. Let's actually install it and look at one thumbnail and one big photo.

To install, I used the pre-compiled binaries from this wonderful site. Like this:

# wget
# dpkg -i mozjpeg_3.1_amd64.deb
# ls -l /opt/mozjpeg/bin/cjpeg
-rwxr-xr-x 1 root root 50784 Sep  3 19:03 /opt/mozjpeg/bin/cjpeg

I don't know why the binary executable becomes called cjpeg but that's fine. Let's put it in $PATH so other users can execute it:

# cd /usr/local/bin
# ln -s /opt/mozjpeg/bin/cjpeg

Now, let's actually use it for something. First we need a realistic lossy thumbnail that we can optimize.

$ wget

This was one of the thumbnails from a previous post called Panasonic Lumix from 2008 or a iPhone 5S from 2014.

Let's optimize!

$ jpeg -outfile ebf08e64e80170dc009e97f6f9681ceb.moz.jpg -optimise ebf08e64e80170dc009e97f6f9681ceb.jpg
$ ls -l ebf08e64e80170dc009e97f6f9681ceb.*
-rw-rw-r-- 1 django django 11391 Sep 26 17:04 ebf08e64e80170dc009e97f6f9681ceb.jpg
-rw-r--r-- 1 django django  9414 Oct 10 01:40 ebf08e64e80170dc009e97f6f9681ceb.moz.jpg

Yay! It's 17.4% smaller. Saving 1.93Kb.

So what do they look like? See for yourself:

I have to zoom in (⌘-+) 3 times until I can see any difference. But remember, the saving isn't massive but the usecase here is a thumbnail.

So, let's do the same with a non-thumbnail. Some huge JPEG.

$ time cjpeg -outfile Lumix-2.moz.jpg -optimise Lumix-2.jpg
real    0m3.285s
user    0m3.122s
sys     0m0.080s
$ ls -l Lumix*
-rw-rw-r-- 1 django django 4880446 Sep 26 17:20 Lumix-2.jpg
-rw-rw-r-- 1 django django 1546978 Oct 10 02:02 Lumix-2.moz.jpg
$ ls -lh Lumix*
-rw-rw-r-- 1 django django 4.7M Sep 26 17:20 Lumix-2.jpg
-rw-rw-r-- 1 django django 1.5M Oct 10 02:02 Lumix-2.moz.jpg

In other words, from 4.7Mb to 1.5Mb. It's 68.3% the size of the original. And the visual difference?

Again, I have to zoom in 3 times to be able to tell any difference and even when I've done that it's hard to tell which is which.

In conclusion, let's go ahead and use mozjpeg to optimize thumbnails.

localStorage is not async, but it's FAST!

06 October 2015 7 comments   Web development, AngularJS, Javascript

A long time I go I wrote an angular app that was pleasantly straight forward. It loads all records from the server in one big fat AJAX GET. The data is large, ~550Kb as a string of JSON, but that's OK because it's a fat-client app and it's extremely unlikely to grow any multiples of this. Yes, it'll some day go up to 1Mb but even that is fine.

Once ALL records are loaded with AJAX from the server, you can filter the whole set and paginate etc. It feels really nice and snappy. However, the app is slightly smarter than that. It has two cool additional features...

  1. Every 10 seconds it does an AJAX query to ask "Have any records been modified since {{insert latest modify date of all known records}}?" and if there's stuff, it updates.

  2. All AJAX queries from the server are cached in the browser's local storage (note, I didn't write localStorage, "local storage" encompasses multiple techniques). The purpose of that is to that on the next full load of the app, we can at least display what we had last time whilst we wait for the server to return the latest and greatest via a slowish network request.

  3. Suppose we have brand new browser with no local storage, because the default sort order is always known, instead of doing a full AJAX get of all records, it does a small one first: "Give me the top 20 records ordered by modify date" and once that's in, it does the big full AJAX request for all records. Thus bringing data to the eyes faster.

All of these these optimization tricks are accompanied with a flash message at the top that says: <img src="spinner.gif"> Currently using cached data. Loading all remaining records from server....

When I built this I decided to use localForage which is a convenience wrapper over localStorage AND IndexedDB that does it all asynchronously and with proper promises. And to make it work in AngularJS I used angular-localForage so it would work with Angular's cycle updates without custom $scope.$apply() stuff. I thought the advantage of this is that it being async means that the main event can continue doing important rendering stuff whilst the browser saves things to "disk" in the background.

Also, I was once told that localStorage, which is inherently blocking, has the risk that calling it the first time in a while might cause the browser to have to take a major break to boot data from actual disk into the browsers allocated memory. Turns out, that is extremely unlikely to be a problem (more about this is a future blog post). The warming up of fetching from disk and storing into the browser's memory happens when you start the browser the very first time. Chrome might be slightly different but I'm confident that this is how things work in Firefox and it has for many many months.

What's very important to note is that, by default, localForage will use IndexedDB as the storage backend. It has the advantage that it's async to boot and it supports much large data blobs.

So I timed, how long does it take for localForage to SET and GET the ~500Kb JSON data? I did that like this, for example:

var t0 =;
.then(function(data) {
    var t1 =;
    console.log('GET took', t1 - t0, 'ms');

The results are as follows:

Operation Iterations Average time
SET 4 341.0ms
GET 4 184.0ms

In all fairness, it doesn't actually matter how long it takes to save because my app actually doesn't depend on waiting for that promise to resolve. But it's an interesting number nevertheless.

So, here's what I did. I decided to drop all of that fancy localForage stuff and go back to basics. All I really need is these two operations:

// set stuff
localStorage.setItem('mykey', JSON.stringify(data))
// get stuff
var data = JSON.parse(localStorage.getItem('mykey') || '{}')

So, after I've refactored my code and deleted (6.33Kb + 22.3Kb) of extra .js files and put some performance measurements in:

Operation Iterations Average time
SET 4 5.9ms
GET 4 3.3ms

Just WOW!
That is so much faster. Sure the write operation is now blocking, but it's only taking 6 milliseconds. And the reason it took IndexedDB less than half a second also probably means more hard work for it to sweat CPU over.

Sold? I am :)

django-pipeline + django-jinja

04 October 2015 2 comments   Django

Do you have django-jinja in your Django 1.8 project to help you with your Jinja2 integration, and you use django-pipeline for your static assets?
If so, you need to tie them together by passing pipeline.templatetags.ext.PipelineExtension "to your Jinja2 environment". But how? Here's how:

# in your

from django_jinja.builtins import DEFAULT_EXTENSIONS

        'BACKEND': 'django_jinja.backend.Jinja2',
        'APP_DIRS': True,
        'OPTIONS': {
            'match_extension': '.jinja',
            'context_processors': [
            'extensions': DEFAULT_EXTENSIONS + [

Now, in your template you simply use the {% stylesheet '...' %} or {% javascript '...' %} tags in your .jinja templates without the {% load pipeline %} stuff.

It took me a little while to figure that out so I hope it helps someone else googling around for a solution alike.

Using Lovefield as an AJAX proxy maybe

30 September 2015 1 comment   Web development, Javascript

Lovefield, by Arthur Hsu at Google, is a cool little Javascript browser abstraction on top of IndexedDB. IndexedDB is this amazingly powerful asynchronous framework for storing data in the browser tied to the domain you're visiting. Unlike it's much "simpler" sibling localStorage, with IndexedDB you can store individual keys in a schema, use indexes for faster retrieval, asynchronous querying and much larger memory capacity than general DOM storage (e.g. localStorage and sessionStorage).

What Lovefield brings is best described by watching this video. But to save you time let me try...:

Anyway, it sounds really cool and I'm looking forward to using it for something. But before I do I thought I'd try using it as a "AJAX proxy".

So what's an AJAX proxy, you ask. Well, it can mean many things but what I have in mind is a pattern where a web app's MVC is tied to a local storage and the local storage is tied to AJAX. That means two immediate benefits:

Another subtle benefit is a "corner case" of that offline capability; when you load up the app you can read from local storage much much faster than from the network meaning you can display user data on the screen sooner. (With the obvious caveat that it might be stale and that the data will change once the network read is completed later)

So, to take this idea for a spin (the use of a local storage to remember the loaded data last time first) I extended my AJAX or Not playground with a hybrid that uses React to render the data, but render the data from Lovefield (and from localStorage too). Remember, it's an experiment so it's a bit clunky and perhaps contrieved. The objective is to notice how soon after loading the page, that data because available for your eyes to consume.

Here is the playground test page

You have to load it at least once to fill your IndexedDB with some data from an AJAX request. Then, reload the page and it'll display what it has locally (in IndexedDB extracted with the Lovefield API). Then, after it's loaded, try refreshing the browser repeatedly. With or without a Wifi connection.

Basically, it works. However, perhaps I've chosen the worst possible test bed for playing with Lovefield. Because it is super slow. If you open the web console, you'll see it reports how long it takes to extract the data out of Lovefield. The code looks like this:

getPostsFromDB: function() {
  return schemaBuilder.connect().then(function(db) {
    var table = db.getSchema().table('post');
var t0 =;
.then(function(results) {
  var t1 =;
  console.log(results.length, 'records extracted');
  console.log((t1 - t0).toFixed(2) + 'ms to extract');

You can see the source here in full.

So out of curiousity, I forked this experiment. Kept almost all the React code but replaced the Lovefield stuff with good old JSON.parse(localStorage.getItem('posts') || '[]'). See code here.
This only takes 1-2 milliseconds. Lovefield repeatedly takes about 400-550 milliseconds on my Firefox version 43.

By the way, load up the localStorage fork and after a first load, try refreshing it over and over and notice how amazingly fast it is. Yay localStorage!

Panasonic Lumix from 2008 or a iPhone 5S from 2014

26 September 2015 5 comments   Photos

Rummaging through an old box I found an old digital camera I bought in 2008, the Panasonic Lumix DMC-LX3. It was hot stuff when it came out and I loved it. So much lighter and smaller than my previous Nikon DSLR behemoth.

But how does this 7 year old camera compare to my iPhone 5S?? Without any scientific rigor I went to the park and took one picture with each "camera" (the iPhone is not really a camera, it just has a (good) camera).

Note! The thumbnails shown below are heavily optimized for web use. You have to click to see the original.

Here are the pictures taken with the Lumix:

Lumix 1

Lumix 2

Lumix 3

Lumix 4

Lumix 5

And here are the pictures taken with the iPhone 5S:

iPhone 5S 1

iPhone 5S 2

iPhone 5S 3

iPhone 5S 4

iPhone 5S 5

To compare, the best thing you can do is to open one of each so to say in separate tabs, or download, and zoom in and stare it down.

The total pixel area across all 5 images is about the same. The iPhone 5S pictures are slightly smaller in terms of dimension. The Lumix pictures are all 3,648x2,736 pixels. The iPhone 5S pictures are 3,264x2,448 pixels.

The 5 Lumix pictures weigh 19.1Mb and the iPhone 5S pictures weigh 11.6Mb.


In conclusion

I don't know which is better. The Lumix weighs more and is bigger volume than the iPhone and it doesn't have a web browser, GPS or WiFi. So if the pictures are about the same, the iPhone wins.

What do you think? If we ignore the practical aspect of carrying the Lumix, which pictures do you prefer?

ElasticSearch, snowball analyzer and stop words

25 September 2015 0 comments   Python

Disclaimer: I'm an ElasticSearch noob. Go easy on me

I have an application that uses ElasticSearch's more_like_this query to find related content. It basically works like this:

>>> index(index, doc_type, {'id': 1, 'title': 'Your cool title is here'})
>>> index(index, doc_type, {'id': 2, 'title': 'About is a cool headline'})
>>> index(index, doc_type, {'id': 3, 'title': 'Titles are your big thing'})

Then you can pick one ID (1, 2 or 3) and find related ones.
We can tell by looking at these three silly examples, the 1 and 2 have the words "is" and "cool" in common. 1 and 3 have "title" (stemming taken into account) and "your" in common. However, is there much value in connected these documents on the words "is" and "your"? I think not. Those are stop words. E.g. words like "the", "this", "from", "she" etc. Basically words that are commonly used as "glue" between more unique and specific words.

Anyway, if you index something in ElasticSearch as a text field you get, by default, the "standard" analyzer to analyze the incoming stuff to be indexed. The standard analyzer just splits the words on whitespace. A more compelling analyzer is the Snowball analyzer (original here) which supports intelligent stemming (turning "wife" ~= "wives") and stop words.

The problem is that the snowball analyzer has a very different set of stop words. We did some digging and thought this was the list it bases its English stop words on. But this was wrong. Note that that list has words like "your" and "about" listed there.

The way to find out how your analyzer treats a string and turns it into token is to the the _analyze tool. For example:

curl -XGET 'localhost:9200/{myindexname}/_analyze?analyzer=snowball' -d 'about your special is a the word' | json_print
  "tokens": [
      "end_offset": 5,
      "token": "about",
      "type": "",
      "start_offset": 0,
      "position": 1
      "end_offset": 10,
      "token": "your",
      "type": "",
      "start_offset": 6,
      "position": 2
      "end_offset": 18,
      "token": "special",
      "type": "",
      "start_offset": 11,
      "position": 3
      "end_offset": 32,
      "token": "word",
      "type": "",
      "start_offset": 28,
      "position": 7

So what you can see is that it finds the tokens "about", "your", "special" and "word". But it stop word ignored "is", "a" and "the". Hmm... I'm not happy with that. I don't think "about" and "your" are particularly helpful words.

So, how do you define your own stop words and override the one in the Snowball analyzer? Well, let me show you.

In code, I use pyelasticsearch so the index creation is done in Python.

    "a able about across after all almost also am among an and "
    "any are as at be because been but by can cannot could dear "
    "did do does either else ever every for from get got had has "
    "have he her hers him his how however i if in into is it its "
    "just least let like likely may me might most must my "
    "neither no nor not of off often on only or other our own "
    "rather said say says she should since so some than that the "
    "their them then there these they this tis to too twas us "
    "wants was we were what when where which while who whom why "
    "will with would yet you your".split()

def create():
    es = get_connection()
    index = get_index()
    es.create_index(index, settings={
        'settings': {
            'analysis': {
                'analyzer': {
                    'extended_snowball_analyzer': {
                        'type': 'snowball',
                        'stopwords': STOPWORDS,
        'mappings': {
            doc_type: {
                'properties': {
                    'title': {
                        'type': 'string',
                        'analyzer': 'extended_snowball_analyzer',

With that in place, now delete your index and re-create it. Now you can use the _analyze tool again to see how it analyzes text on this particular field. But note, to do this we need to know the name of the index we used. (so replace {myindexname} in the URL):

$ curl -XGET 'localhost:9200/{myindexname}/_analyze?field=title' -d 'about your special is a the word' | json_print
  "tokens": [
      "end_offset": 18,
      "token": "special",
      "type": "",
      "start_offset": 11,
      "position": 3
      "end_offset": 32,
      "token": "word",
      "type": "",
      "start_offset": 28,
      "position": 7

Cool! Now we see that it considers "about" and "your" as stop words. Much better. This is handy too because you might have certain words that are globally not very common but within your application it's very repeated and not very useful.

Thank you willkg and Erik Rose for your support in tracking this down!


14 September 2015 2 comments   Python, Django

I'm working on a (side)project in Django that uses the awesome Semantic UI CSS framework. This project has some Django forms that are rendered on the server and so I can't let Django render the form HTML or else the CSS framework can't do its magic.

The project is called django-semanticui-form and it's a fork from django-bootstrap-form.

It doesn't come with the Semantic UI CSS files at all. That's up to you. Semantic UI is available as a big fat bundle (i.e. one big .css file) but generally you just pick the components you want/need. To use it in your Django templates simply, create a django.forms.Form instance and render it like this:

{% load semanticui %}

  {{ myform | semanticui }}

The project is very quickly put together. The elements I intend to render seem to work but you might find that certain input elements don't work as nicely. However, if you want to help on the project, it's really easy to write tests and run tests. And Travis and automatic PyPI deployment is all set up so pull requests should be easy.