Optimization of getting random rows out of a PostgreSQL in Django

Wednesday, Feb 23, 2011

⬅︎ Back to Optimization of getting random rows out of a PostgreSQL in Django

Comment

hutch February 24, 2011

i don't know about a at all, it's something to test, i'm just throwing out an alternate technique to try.

b) it's zero indexed and you should be adding in checks that int(random() * count) < count

c) you're right about offset and limit, but you don't need an order because you're pulling a random one anyway, so it's irrelevant.

Parent comment

Peter Bengtsson February 24, 2011

a) count() is much slower than max() or using the sequence currval() b) if you have the IDs 1,2,4 the count will be 3 so you'd have to run randint(1,3). That way you're never picking up ID=4. c) what does that slice translate to in SQL? Is it "LIMIT 1 OFFSET "? If so, don't you have to order?

Replies

Peter Bengtsson February 24, 2011

a) trust me, it is. considerably

c) have you tried it? It's ultra-slow

hutch February 25, 2011

on c, taking a slice rather than get(pk=X)

i just tried it (this is on sqlite though, which just occurred to me) could someone with a real db test this?

t0 = time()
blah = Model.objects.all()[3].pk
print '%f seconds' % (time() - t0)

t0 = time()
blah = model.objects.model.objects.get(pk=2).pk
print '%f seconds' % (time() - t0)

yielded these three trials

0.000898 seconds
0.001472 seconds

0.001804 seconds
0.001447 seconds

0.001948 seconds
0.001504 seconds

when i write out the function, similar to the using_max function, i get these results (first is using count and slices, second is the using_max function

0.007093 seconds
0.010980 seconds

0.006881 seconds
0.011418 seconds

0.006947 seconds
0.011399 seconds

count = model.objects.all().count()
i = 0
while i < TIMES:
try:
yield model.objects.all()[random.randint(0, count-1)].pk
i += 1
except model.DoesNotExist:
pass

like i said though, his is on sqlite, with probably not a representative dataset.

Peter Bengtsson February 25, 2011

Do it on a proper database like postgres and copy and paste the SQL that Django generates then run it like this:

mydatabase# EXPLAIN ANALYZE <the big sql statement>;

hutch February 25, 2011

no need, now that i've filled up the database with faked data, we get this:

0.862994 seconds
0.104061 seconds

0.894336 seconds
0.114008 seconds

0.809722 seconds
0.102276 seconds

Peter Bengtsson February 25, 2011

Are you sure you're doing that right? I just tried it myself and got this:

COUNT 84482
using_max() took 0.613966941833 seconds
using_max2() took 2.08254098892 seconds
using_count_and_slice() took 14.112842083 seconds

Code here: http://www.peterbe.com/plog/getting-random-rows-postgresql-django/get_random_ones.py

hutch February 25, 2011

no, i think the numbers jive, and you're right.

it seems to be a 7x factor using the count/slice method i was talking about.

Peter Bengtsson February 25, 2011

Don't think so, out of the 14 seconds about 0.4 seconds is spent getting the counts.