Optimization of getting random rows out of a PostgreSQL in Django

Wednesday, Feb 23, 2011

⬅︎ Back to Optimization of getting random rows out of a PostgreSQL in Django

Comment

Peter Bengtsson February 23, 2011

Actually, your algorithm is broken. Suppose you add:
row1, row2, row3, row4.
The "position" value will be 4 (randint(1, 4))
Then you delete row3. The "position" value will be decrement to 3 (randint(1, 3)). How then will it ever randomly pick row4?

Parent comment

Ken Swift February 23, 2011

I've meant: select max(position) from table; But I see my solution is not clear so i will explain once more. We have table with rows and id column as pk. If no rows were ever deleted then we would have id values varying from 1 to number of rows. So max(id) gives us number of rows. However if some rows were deleted then max(id) != rows number, hence we cant use random.randint(1, max_id_value_returned_from_database) because there are some gaps in id sequence. But if we add additional column say "position" like in my example and add trigger to that table we could maintain this column value in given fashion: 1. if row is added, select maximum existing value for column position. Then increment it by one and save it to the new record. 2. if row is deleted then update all rows that have "position" value bigger then deleted row. All updated rows will reduce "position" value by 1. That is how i enforce sequence with no gaps in it. Now that we have sequence with no gaps we can get maximum value of this sequence, generate some random number in python, and retrieve all rows with one query. Also we could add index for that column, but since we are retrieving randoms rows random fetches for database are inevitable.

Replies

Ken Swift February 24, 2011

Read my post again. If you delete row3, value of "position" for row4 will be decremented by 1, so, after delete it will have value of 3. This leads us to place where everything is alright.

Peter Bengtsson February 24, 2011

I see! So the position is kept with the table.

But now you need an index on the position field too otherwise you can't pick from it quickly. Seems tight but extremely convoluted.

Ken Swift February 24, 2011

"On my desktop time to fetch 100random rows from 100000row table is.... ~40ms"

This was done without index and worked perfectly fine. Also, I'am glad that I finally make myself clear about this solution :)

Jay Wineinger February 24, 2011

So if I understand, you're gaining some optimization for the random SELECT but you're causing an UPDATE on a possibly large number of rows for every DELETE? Example: if you DELETE the first row in the table, you now have to UPDATE every other row. Seems like this would be a high cost to pay, especially if your app does frequent deletes.

Ken Swift February 24, 2011

You are correct. There never is golden solution for each case. All tables have theirs specifics and where on solutions is great it is a failure in another one.

Nathan Florea March 19, 2011

Rather than decrementing the position of every row above the deleted one, you should just give the highest one the position of the deleted one. So if you have rows 1, 2, 3, 4 and you delete 2, you have 1, 3, 2. That way you only have to make one update.

Ken Swift March 19, 2011

I dont think that would work, ie. we got 1,2,3,4. We delete 2
1,3,4 are left and we are missing 2. Flowing ur advice i can update last value (4) and get 1,3,2.

Now, I delete 1. 3,2 are left and updating last value wont give us a correct sequence

Nathan Florea March 19, 2011

No, you would replace the 3 with 1. That is the highest position value in the sequence 1,3,2. Not the *last* value in position, the *highest*.