14 January 2011 5 comments Django
For those familiar with the Django ORM they know how easy it is to work with and that you can do lots of nifty things with the result (QuerySet in Django lingo).
So I was working report that basically just needed to figure out if a particular product has been invoiced. Not for how much or when, just if it's included in an invoice or not.
The code was initially this:
def is_invoiced(self): for __ in Invoice.objects.filter(products=self): return True return False # prefer 'False' to 'None' in this case
Invoice model has an automatic ordering on its date field I thought, that doing a loop would put that ordering into play each time which would suck. A quick way around that is to aggregate instead. First rewrite:
def is_invoiced(self): return bool(Invoice.objects.filter(products=self).count())
If you run the "EXPLAIN ANALYZE ..." on that SQL in PostgreSQL you'll notice that the aggregate causes one extra operation before it goes into doing any other filtering. Hmm... Perhaps I can optimize it even further by just doing a simple select but without the ordering and to optimize it even further I make it return a specific field only since this table has many fields and doing
SELECT * FROM would be a waste of time. All I need is anything,
id will do:
def is_invoiced(self): qs = Invoice.objects.filter(products=self) qs.query.clear_ordering(True) return bool(qs.only('id'))
Right, here's the kicker! I put all of these different "patterns" into simple files
B.sql, etc. Then, I ran these against a database where the relevant table contains about 1,500 rows like this:
$ time for i in $(seq 1 1000); do psql mydatabase < X.sql > /dev/null ; done
This doesn't test the speed of the database or the SQL statement because the overhead is spent on the reading from
stdin and opening the database. So, for each file I copied the statement over 200 lines.
Also, to more simulate my real application, instead of filtering on an indexed primary key I made it into a simple operator on an integer number which yields about 1% of the whole table.
After having run them a bunch of times each and measured and noted their times I can conclude the following result:
* Doing an aggregate on COUNT(*): 100% * Doing a full select without ordering: 200% * Doing a full select with ordering: 200% * Doing a select only on the id without ordering: 100% * Doing a aggregate on COUNT('id'): 100%
What this means is that doing an aggregate takes as long on
* as it does on a primary key. Doing a minimal select without any ordering is as fast as doing the aggregate.
PostgreSQL is a smart cookie! This isn't news to me but it sure proves itself again. It knows to do the right things first and it knows what's needed in the end and runs the query in a very intelligent way.
Going back to the Django aspect of this. Surely, if you run this many many times there's an overhead when the ORM first applies itself to what's received back from the SQL and the
__bool__ operator of the QuerySet completely discards the columns to fields conversion. Just having to get a simple integer back from the database would surely be faster. I would love to have time to dig further into that too one day.