Fastest "boolean SQL queries" possible with Django

14 January 2011   5 comments   Django

Powered by Fusion×

For those familiar with the Django ORM they know how easy it is to work with and that you can do lots of nifty things with the result (QuerySet in Django lingo).

So I was working report that basically just needed to figure out if a particular product has been invoiced. Not for how much or when, just if it's included in an invoice or not.

The code was initially this:

def is_invoiced(self):
   for __ in Invoice.objects.filter(products=self):
      return True
    return False # prefer 'False' to 'None' in this case

Since the Invoice model has an automatic ordering on its date field I thought, that doing a loop would put that ordering into play each time which would suck. A quick way around that is to aggregate instead. First rewrite:

def is_invoiced(self):
   return bool(Invoice.objects.filter(products=self).count())

If you run the "EXPLAIN ANALYZE ..." on that SQL in PostgreSQL you'll notice that the aggregate causes one extra operation before it goes into doing any other filtering. Hmm... Perhaps I can optimize it even further by just doing a simple select but without the ordering and to optimize it even further I make it return a specific field only since this table has many fields and doing SELECT * FROM would be a waste of time. All I need is anything, id will do:

def is_invoiced(self):
    qs = Invoice.objects.filter(products=self)
    return bool(qs.only('id'))

Right, here's the kicker! I put all of these different "patterns" into simple files A.sql, B.sql, etc. Then, I ran these against a database where the relevant table contains about 1,500 rows like this:

$ time for i in $(seq 1 1000); do psql mydatabase < X.sql > /dev/null ; done

This doesn't test the speed of the database or the SQL statement because the overhead is spent on the reading from stdin and opening the database. So, for each file I copied the statement over 200 lines.

Also, to more simulate my real application, instead of filtering on an indexed primary key I made it into a simple operator on an integer number which yields about 1% of the whole table.

After having run them a bunch of times each and measured and noted their times I can conclude the following result:

* Doing an aggregate on COUNT(*):  100%
* Doing a full select without ordering: 200%
* Doing a full select with ordering: 200%
* Doing a select only on the id without ordering: 100%
* Doing a aggregate on COUNT('id'): 100% 

What this means is that doing an aggregate takes as long on * as it does on a primary key. Doing a minimal select without any ordering is as fast as doing the aggregate.

PostgreSQL is a smart cookie! This isn't news to me but it sure proves itself again. It knows to do the right things first and it knows what's needed in the end and runs the query in a very intelligent way.

Going back to the Django aspect of this. Surely, if you run this many many times there's an overhead when the ORM first applies itself to what's received back from the SQL and the __bool__ operator of the QuerySet completely discards the columns to fields conversion. Just having to get a simple integer back from the database would surely be faster. I would love to have time to dig further into that too one day.


What about calling qs.exists()?
Peter Bengtsson
I didn't know about .exists() aaarrh!!
Got to immediately check it out.
Matthew Scott
I second the "aaarrh!!"

There's a find-and-replace session in my own near future as well. :)
Luke Plant
Out of interest, how does these methods compare with QuerySet.exists()? That method is supposed to be fast, but I'm guessing from the Postgres side it might do effectively the same amount of work.
Peter Bengtsson
First of all, I didn't know about .exists(). That's great! I think syntactically it looks better than converting a count integer to a boolean.

In terms of SQL query complexity it's got the same complexity as a count. I'd need a larger set and a more realistic query to find out which is most cost effective (or rather, least costly) the LIMIT or the COUNT.

Thanks for the tip!

Your email will never ever be published

Related posts

django-static version 1.5 automatically taking care of imported CSS 11 January 2011
Next: international visitors 21 January 2011
Related by keywords:
Fastest database for Tornado 09 October 2013
Integer division in programming languages 04 August 2004
Optimization of getting random rows out of a PostgreSQL in Django 23 February 2011
Connecting with psycopg2 without a username and password 24 February 2011
Adding a year in PostgreSQL 04 February 2004
pg_class to check if table exists 20 April 2005
Migrating with South on a field that uses auto_now_add=True 16 December 2009
To sub-select or not sub-select in PostgreSQL 31 August 2009
Speed test between django_mongokit and postgresql_psycopg2 09 March 2010
Date formatting in python or in PostgreSQL 20 July 2004
PostgreSQL, MySQL or SQLite 04 April 2004
UPPER vs. ILIKE 19 April 2010