For those familiar with the Django ORM they know how easy it is to work with and that you can do lots of nifty things with the result (QuerySet in Django lingo).

So I was working report that basically just needed to figure out if a particular product has been invoiced. Not for how much or when, just if it's included in an invoice or not.

The code was initially this:

def is_invoiced(self):
   for __ in Invoice.objects.filter(products=self):
      return True
    return False # prefer 'False' to 'None' in this case

Since the Invoice model has an automatic ordering on its date field I thought, that doing a loop would put that ordering into play each time which would suck. A quick way around that is to aggregate instead. First rewrite:

def is_invoiced(self):
   return bool(Invoice.objects.filter(products=self).count())

If you run the "EXPLAIN ANALYZE ..." on that SQL in PostgreSQL you'll notice that the aggregate causes one extra operation before it goes into doing any other filtering. Hmm... Perhaps I can optimize it even further by just doing a simple select but without the ordering and to optimize it even further I make it return a specific field only since this table has many fields and doing SELECT * FROM would be a waste of time. All I need is anything, id will do:

def is_invoiced(self):
    qs = Invoice.objects.filter(products=self)
    qs.query.clear_ordering(True)
    return bool(qs.only('id'))

Right, here's the kicker! I put all of these different "patterns" into simple files A.sql, B.sql, etc. Then, I ran these against a database where the relevant table contains about 1,500 rows like this:

$ time for i in $(seq 1 1000); do psql mydatabase < X.sql > /dev/null ; done

This doesn't test the speed of the database or the SQL statement because the overhead is spent on the reading from stdin and opening the database. So, for each file I copied the statement over 200 lines.

Also, to more simulate my real application, instead of filtering on an indexed primary key I made it into a simple operator on an integer number which yields about 1% of the whole table.

After having run them a bunch of times each and measured and noted their times I can conclude the following result:

* Doing an aggregate on COUNT(*):  100%
* Doing a full select without ordering: 200%
* Doing a full select with ordering: 200%
* Doing a select only on the id without ordering: 100%
* Doing a aggregate on COUNT('id'): 100% 

What this means is that doing an aggregate takes as long on * as it does on a primary key. Doing a minimal select without any ordering is as fast as doing the aggregate.

PostgreSQL is a smart cookie! This isn't news to me but it sure proves itself again. It knows to do the right things first and it knows what's needed in the end and runs the query in a very intelligent way.

Going back to the Django aspect of this. Surely, if you run this many many times there's an overhead when the ORM first applies itself to what's received back from the SQL and the __bool__ operator of the QuerySet completely discards the columns to fields conversion. Just having to get a simple integer back from the database would surely be faster. I would love to have time to dig further into that too one day.

Patrys - 14 January 2011 [«« Reply to this]
What about calling qs.exists()?
Peter Bengtsson - 14 January 2011 [«« Reply to this]
I didn't know about .exists() aaarrh!!
Got to immediately check it out.
Matthew Scott - 15 January 2011 [«« Reply to this]
I second the "aaarrh!!"

There's a find-and-replace session in my own near future as well. :)
Luke Plant - 14 January 2011 [«« Reply to this]
Out of interest, how does these methods compare with QuerySet.exists()? That method is supposed to be fast, but I'm guessing from the Postgres side it might do effectively the same amount of work.
Peter Bengtsson - 14 January 2011 [«« Reply to this]
First of all, I didn't know about .exists(). That's great! I think syntactically it looks better than converting a count integer to a boolean.

In terms of SQL query complexity it's got the same complexity as a count. I'd need a larger set and a more realistic query to find out which is most cost effective (or rather, least costly) the LIMIT or the COUNT.

Thanks for the tip!


Your email will never ever be published