Django 2.0 came out a couple of weeks ago. It now supports "conditional aggregation" which is SQL standard I didn't even know about.
Before
So I have a Django app which has an endpoint that generates some human-friendly stats about the number of uploads (and their total size) in various different time intervals.
First of all, this is how it set up the time intervals:
today = timezone.now()
start_today = today.replace(hour=0, minute=0, second=0)
start_yesterday = start_today - datetime.timedelta(days=1)
start_this_month = today.replace(day=1)
start_this_year = start_this_month.replace(month=1)
And then, for each of these, there's a little function that returns a dict for each time interval:
def count_and_size(qs, start, end):
sub_qs = qs.filter(created_at__gte=start, created_at__lt=end)
return {
'count': sub_qs.count(),
'total_size': sub_qs.aggregate(size=Sum('size'))['size'],
}
numbers['uploads'] = {
'today': count_and_size(upload_qs, start_today, today),
'yesterday': count_and_size(upload_qs, start_yesterday, start_today),
'this_month': count_and_size(upload_qs, start_this_month, today),
'this_year': count_and_size(upload_qs, start_this_year, today),
}
What you get is exactly 2 x 4 = 8 queries. One COUNT
and one SUM
for each time interval. E.g.
SELECT SUM("upload_upload"."size") AS "size" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... SELECT COUNT(*) AS "__count" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... ...6 more queries...
Middle
Oops. I think this code comes from a slightly rushed job. We can do the COUNT
and the SUM
at the same time for each query.
# New, improved count_and_size() function!
def count_and_size(qs, start, end):
sub_qs = qs.filter(created_at__gte=start, created_at__lt=end)
return sub_qs.aggregate(
count=Count('id'),
total_size=Sum('size'),
)
numbers['uploads'] = {
'today': count_and_size(upload_qs, start_today, today),
'yesterday': count_and_size(upload_qs, start_yesterday, start_today),
'this_month': count_and_size(upload_qs, start_this_month, today),
'this_year': count_and_size(upload_qs, start_this_year, today),
}
Much better, now there's only one query per time bucket. So 4 queries in total. E.g.
SELECT COUNT("upload_upload"."id") AS "count", SUM("upload_upload"."size") AS "total_size" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... ...3 more queries...
After
But we can do better than that! Instead, we use conditional aggregation. The syntax gets a bit hairy because there's so many keyword arguments, but I hope I've indented it nicely so it's easy to see how it works:
def make_q(start, end):
return Q(created_at__gte=start, created_at__lt=end)
q_today = make_q(start_today, today)
q_yesterday = make_q(start_yesterday, start_today)
q_this_month = make_q(start_this_month, today)
q_this_year = make_q(start_this_year, today)
aggregates = upload_qs.aggregate(
today_count=Count('pk', filter=q_today),
today_total_size=Sum('size', filter=q_today),
yesterday_count=Count('pk', filter=q_yesterday),
yesterday_total_size=Sum('size', filter=q_yesterday),
this_month_count=Count('pk', filter=q_this_month),
this_month_total_size=Sum('size', filter=q_this_month),
this_year_count=Count('pk', filter=q_this_year),
this_year_total_size=Sum('size', filter=q_this_year),
)
numbers['uploads'] = {
'today': {
'count': aggregates['today_count'],
'total_size': aggregates['today_total_size'],
},
'yesterday': {
'count': aggregates['yesterday_count'],
'total_size': aggregates['yesterday_total_size'],
},
'this_month': {
'count': aggregates['this_month_count'],
'total_size': aggregates['this_month_total_size'],
},
'this_year': {
'count': aggregates['this_year_count'],
'total_size': aggregates['this_year_total_size'],
},
}
Voila! One single query to get all those pieces of data.
The SQL sent to PostgreSQL looks something like this:
SELECT COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_total_size", COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_total_size", ... FROM "upload_upload";
Is this the best thing to do? I'm starting to have my doubts.
Watch Out!
When I take this now 1 monster query for a spin with an EXPLAIN ANALYZE
prefix I notice something worrying!
QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Aggregate (cost=74.33..74.34 rows=1 width=16) (actual time=0.587..0.587 rows=1 loops=1) -> Seq Scan on upload_upload (cost=0.00..62.13 rows=813 width=16) (actual time=0.012..0.210 rows=813 loops=1) Planning time: 0.427 ms Execution time: 0.674 ms (4 rows)
A sequential scan! That's terrible. The created_at
column is indexed in a BTREE
so why can't it use the index.
The short answer is: I don't know!
I've uploaded a reduced, but still complete, example demonstrating this in a gist. It's very similar to the example in the stackoverflow question I asked.
So what did I do? I went back to the "middle" solution. One SELECT
query per time bucket. So 4 queries in total, but at least all 4 is able to use an index.
Comments
Hi!
The reason it may not be using the index is the query planner may have determined that the size of your data set is such that it is faster to perform a sequential scan than an index scan. It looks like in your data set there are only 813 rows, and if you look at the execution time, it's < 1ms, so the sequential scan is doing pretty well in this case ;-)
If you try running it with several tens of thousands of roles, I would bet it would perform an index scan.
Nice blog post!
See my stackoverflow question: https://stackoverflow.com/questions/48233665/can-aggregate-filter-expressions-not-use-indices
In that table (which has 4.6M rows) and trying to select a subset of about 10K rows, the total running time is repeatedly about 1.5s whereas the more traditional query takes 0.005s.
I wonder if the lack of fields in the main query and only fields in subqueries is producing a suboptimal optimization. Does anything change if you add a where upload.created_at > start_of_year to the outer select?
have the same assumption, dividing into 2 sets might yield better results (select count,sum from yesterday UNION ALL select ... from today). Are exact values needed? then you could get rid of aggregating kbytes