a) count() is much slower than max() or using the sequence currval()
b) if you have the IDs 1,2,4 the count will be 3 so you'd have to run randint(1,3). That way you're never picking up ID=4.
c) what does that slice translate to in SQL? Is it "LIMIT 1 OFFSET "? If so, don't you have to order?
when i write out the function, similar to the using_max function, i get these results (first is using count and slices, second is the using_max function
0.007093 seconds 0.010980 seconds
0.006881 seconds 0.011418 seconds
0.006947 seconds 0.011399 seconds
count = model.objects.all().count() i = 0 while i < TIMES: try: yield model.objects.all()[random.randint(0, count-1)].pk i += 1 except model.DoesNotExist: pass
like i said though, his is on sqlite, with probably not a representative dataset.
Comment
i don't know about a at all, it's something to test, i'm just throwing out an alternate technique to try.
b) it's zero indexed and you should be adding in checks that int(random() * count) < count
c) you're right about offset and limit, but you don't need an order because you're pulling a random one anyway, so it's irrelevant.
Parent comment
a) count() is much slower than max() or using the sequence currval() b) if you have the IDs 1,2,4 the count will be 3 so you'd have to run randint(1,3). That way you're never picking up ID=4. c) what does that slice translate to in SQL? Is it "LIMIT 1 OFFSET"? If so, don't you have to order?
Replies
a) trust me, it is. considerably
c) have you tried it? It's ultra-slow
on c, taking a slice rather than get(pk=X)
i just tried it (this is on sqlite though, which just occurred to me) could someone with a real db test this?
t0 = time()
blah = Model.objects.all()[3].pk
print '%f seconds' % (time() - t0)
t0 = time()
blah = model.objects.model.objects.get(pk=2).pk
print '%f seconds' % (time() - t0)
yielded these three trials
0.000898 seconds
0.001472 seconds
0.001804 seconds
0.001447 seconds
0.001948 seconds
0.001504 seconds
when i write out the function, similar to the using_max function, i get these results (first is using count and slices, second is the using_max function
0.007093 seconds
0.010980 seconds
0.006881 seconds
0.011418 seconds
0.006947 seconds
0.011399 seconds
count = model.objects.all().count()
i = 0
while i < TIMES:
try:
yield model.objects.all()[random.randint(0, count-1)].pk
i += 1
except model.DoesNotExist:
pass
like i said though, his is on sqlite, with probably not a representative dataset.
Do it on a proper database like postgres and copy and paste the SQL that Django generates then run it like this:
mydatabase# EXPLAIN ANALYZE <the big sql statement>;
no need, now that i've filled up the database with faked data, we get this:
0.862994 seconds
0.104061 seconds
0.894336 seconds
0.114008 seconds
0.809722 seconds
0.102276 seconds
Are you sure you're doing that right? I just tried it myself and got this:
COUNT 84482
using_max() took 0.613966941833 seconds
using_max2() took 2.08254098892 seconds
using_count_and_slice() took 14.112842083 seconds
Code here: http://www.peterbe.com/plog/getting-random-rows-postgresql-django/get_random_ones.py
no, i think the numbers jive, and you're right.
it seems to be a 7x factor using the count/slice method i was talking about.
Don't think so, out of the 14 seconds about 0.4 seconds is spent getting the counts.