23 February 2011 48 comments Django
There was a really interesting discussion on the django-users mailing list about how to best select random elements out of a SQL database the most efficient way. I knew using a regular
RANDOM() in SQL can be very slow on big tables but I didn't know by how much. Had to run a quick test!
Cal Leeming discussed a snippet of his to do with pagination huge tables which uses the
MAX(id) aggregate function.
So, I did a little experiment on a table with 84,000 rows in it. Realistic enough to matter even though it's less than millions. So, how long would it take to select 10 random items, 10 times? Benchmark code looks like this:
TIMES = 10 def using_normal_random(model): for i in range(TIMES): yield model.objects.all().order_by('?').pk t0 = time() for i in range(TIMES): list(using_normal_random(SomeLargishModel)) t1 = time() print t1-t0, "seconds"
Nasty!! Also running this you'll notice postgres spiking your CPU like crazy.
A much better approach is to use Python's
random.randint(1, <max ID>). Looks like this:
from django.db.models import Max from random import randint def using_max(model): max_ = model.objects.aggregate(Max('id'))['id__max'] i = 0 while i < TIMES: try: yield model.objects.get(pk=randint(1, max_)).pk i += 1 except model.DoesNotExist: pass t0 = time() for i in range(TIMES): list(using_max(SomeLargishModel)) t1 = time() print t1-t0, "seconds"
Much more pleasant!
Commentator, Ken Swift, asked what if your requirement is to select 100 random items instead of just 10. Won't those 101 database queries be more costly than just 1 query with a
RANDOM(). Answer turns out to be no.
I changed the script to select 100 random items 1 time (instead of 10 items 10 times) and the times were the same:
using_normal_random() took 41.4467599392 seconds using_max() took 0.6027739048 seconds
And what about 1000 items 1 time:
using_normal_random() took 204.685141802 seconds using_max() took 2.49527382851 seconds
The algorithm for returning a generator has a couple of flaws:
Here's a much more complete function:
def random_queryset_elements(qs, number): assert number <= 10000, 'too large' max_pk = qs.aggregate(Max('pk'))['pk__max'] min_pk = qs.aggregate(Min('pk'))['pk__min'] ids = set() while len(ids) < number: next_pk = random.randint(min_pk, max_pk) while next_pk in ids: next_pk = random.randint(min_pk, max_pk) try: found = qs.get(pk=next_pk) ids.add(found.pk) yield found except qs.model.DoesNotExist: pass