There was a really interesting discussion on the django-users mailing list about how to best select random elements out of a SQL database the most efficient way. I knew using a regular RANDOM()
in SQL can be very slow on big tables but I didn't know by how much. Had to run a quick test!
Cal Leeming discussed a snippet of his to do with pagination huge tables which uses the MAX(id)
aggregate function.
So, I did a little experiment on a table with 84,000 rows in it. Realistic enough to matter even though it's less than millions. So, how long would it take to select 10 random items, 10 times? Benchmark code looks like this:
TIMES = 10
def using_normal_random(model):
for i in range(TIMES):
yield model.objects.all().order_by('?')[0].pk
t0 = time()
for i in range(TIMES):
list(using_normal_random(SomeLargishModel))
t1 = time()
print t1-t0, "seconds"
Result:
41.8955321312 seconds
Nasty!! Also running this you'll notice postgres spiking your CPU like crazy.
A much better approach is to use Python's random.randint(1, <max ID>)
. Looks like this:
from django.db.models import Max
from random import randint
def using_max(model):
max_ = model.objects.aggregate(Max('id'))['id__max']
i = 0
while i < TIMES:
try:
yield model.objects.get(pk=randint(1, max_)).pk
i += 1
except model.DoesNotExist:
pass
t0 = time()
for i in range(TIMES):
list(using_max(SomeLargishModel))
t1 = time()
print t1-t0, "seconds"
Result:
0.63835811615 seconds
Much more pleasant!
UPDATE
Commentator, Ken Swift, asked what if your requirement is to select 100 random items instead of just 10. Won't those 101 database queries be more costly than just 1 query with a RANDOM()
. Answer turns out to be no.
I changed the script to select 100 random items 1 time (instead of 10 items 10 times) and the times were the same:
using_normal_random() took 41.4467599392 seconds
using_max() took 0.6027739048 seconds
And what about 1000 items 1 time:
using_normal_random() took 204.685141802 seconds
using_max() took 2.49527382851 seconds
UPDATE 2
The algorithm for returning a generator has a couple of flaws:
- Can't pass in a QuerySet
- You get primary keys returned, not ORM instances
- You can't pass in a number
- Internally, it might randomly select a number already tried
Here's a much more complete function:
def random_queryset_elements(qs, number):
assert number <= 10000, 'too large'
max_pk = qs.aggregate(Max('pk'))['pk__max']
min_pk = qs.aggregate(Min('pk'))['pk__min']
ids = set()
while len(ids) < number:
next_pk = random.randint(min_pk, max_pk)
while next_pk in ids:
next_pk = random.randint(min_pk, max_pk)
try:
found = qs.get(pk=next_pk)
ids.add(found.pk)
yield found
except qs.model.DoesNotExist:
pass
Comments
Post your own commentI have used a similar approach but sometimes it can fail if rows have been deleted. Thus you should only use this method if you know that all of the ID's from 1 to 'max' still exist (as a side point - if you do know they all exist, a count may be faster than a Max aggregate).
Anyway, to get around this I used something like;
model.objects.filter(id__gte=randint(1, max_))[0]
if you get a random id that doesn't exist the loop will just try another number until it has rounded up as many as it needs.
A count is never faster than a MAX aggregate. Certainly not in Postgres
Well, your solution is bugged :) You cant use max(id) since some record could have been deleted, so your randinit(1, max_) would produce id that do not exists.
But ofcourse there is bulletproof solution. Just add column to the table say "number" write a triger wich will maintain asceding int values with no gap. finaly use radint(1, max(number_column)) to get random rows.
It's not bugged. See what I wrote to Dougal. It re-tries if it doesn't exist.
The assumption is that the count is quite similar to the max.
Aff, my mistake. But still let assume You want to get 100 rows in random order. With Your solution script would execute more then 100 queries to database. It would be slow. So why are so reluctant to use trigers wich guarantees you to get what You want with one query?
Offtopic: I am not big fan of orms, and I am surprise that python/django community tries to solve database problems with weird python/django orm solutions rather then use database itself, which results in greater performance and readability.
Check the update I just added to the blog post. Selecting 100 random items only once takes the same amount of time.
How would one use triggers to solve this in a better way?
CREATE OR REPLACE FUNCTION position_update() RETURNS TRIGGER AS $$
DECLARE
next_position INTEGER;
BEGIN
IF (TG_OP = 'INSERT') THEN
SELECT (COALESCE(MAX(position), 0) + 1) INTO next_position FROM table;
UPDATE table SET position = next_position WHERE id = NEW.id;
ELSEIF (TG_OP = 'DELETE') THEN
UPDATE table SET position = position - 1 WHERE position > OLD.position ;
END IF;
RETURN NULL;
END;
$$
LANGUAGE 'plpgsql' VOLATILE
COST 100;
DROP TRIGGER IF EXISTS position_trigger ON table;
CREATE TRIGGER position_trigger AFTER INSERT OR DELETE ON table FOR EACH ROW EXECUTE PROCEDURE position_update();
Now we can:
1. Select max(position) from table;
2. generate random numbers from 1 to max
3. fetch all rows in one query using where in
On my desktop time to fetch 100random rows from 100000row table is.... ~40ms
Its a difference? Isn't it ?
I have no idea what that does. How do you use it to select random elements?
1. Create triger I've posted
2. Select maximum value of "position" column
3. GEnerate your random numbers varying from 1 to value form point 2.
4. Select rows from table using where predicate:
"where position in (generated numbers from point 3)"
I dont know django ORM so well so I can't translate it to python code :) But i assume u can use "where in" predicate in it.
Again, I'm feeling really dim. What do you mean by "Select maximum value of "position" column"??
In python, to do `random.randint(1, SOME_BIG_NUMBER)` is very quick and that's not where the optimization win is.
I've meant: select max(position) from table;
But I see my solution is not clear so i will explain once more.
We have table with rows and id column as pk. If no rows were ever deleted then we would have id values varying from 1 to number of rows. So max(id) gives us number of rows. However if some rows were deleted then max(id) != rows number, hence we cant use random.randint(1, max_id_value_returned_from_database) because there are some gaps in id sequence. But if we add additional column say "position" like in my example and add trigger to that table we could maintain this column value in given fashion:
1. if row is added, select maximum existing value for column position. Then increment it by one and save it to the new record.
2. if row is deleted then update all rows that have "position" value bigger then deleted row. All updated rows will reduce "position" value by 1. That is how i enforce sequence with no gaps in it.
Now that we have sequence with no gaps we can get maximum value of this sequence, generate some random number in python, and retrieve all rows with one query. Also we could add index for that column, but since we are retrieving randoms rows random fetches for database are inevitable.
Some caps are OK. Read my code again. If COUNT(id) is 10,000 and MAX(id) is 11,000 that means that you have about 10% blanks in there so the loop that keeps going till it has fetched 'TIMES' random entries will have to loop for about 11 (10 + 10% of 10) times.
If is the kind of application where deletes are very common, look at Simon's solution.
Actually, your algorithm is broken. Suppose you add:
row1, row2, row3, row4.
The "position" value will be 4 (randint(1, 4))
Then you delete row3. The "position" value will be decrement to 3 (randint(1, 3)). How then will it ever randomly pick row4?
Read my post again. If you delete row3, value of "position" for row4 will be decremented by 1, so, after delete it will have value of 3. This leads us to place where everything is alright.
I see! So the position is kept with the table.
But now you need an index on the position field too otherwise you can't pick from it quickly. Seems tight but extremely convoluted.
"On my desktop time to fetch 100random rows from 100000row table is.... ~40ms"
This was done without index and worked perfectly fine. Also, I'am glad that I finally make myself clear about this solution :)
So if I understand, you're gaining some optimization for the random SELECT but you're causing an UPDATE on a possibly large number of rows for every DELETE? Example: if you DELETE the first row in the table, you now have to UPDATE every other row. Seems like this would be a high cost to pay, especially if your app does frequent deletes.
You are correct. There never is golden solution for each case. All tables have theirs specifics and where on solutions is great it is a failure in another one.
Rather than decrementing the position of every row above the deleted one, you should just give the highest one the position of the deleted one. So if you have rows 1, 2, 3, 4 and you delete 2, you have 1, 3, 2. That way you only have to make one update.
I dont think that would work, ie. we got 1,2,3,4. We delete 2
1,3,4 are left and we are missing 2. Flowing ur advice i can update last value (4) and get 1,3,2.
Now, I delete 1. 3,2 are left and updating last value wont give us a correct sequence
No, you would replace the 3 with 1. That is the highest position value in the sequence 1,3,2. Not the *last* value in position, the *highest*.
@offtopic the reason for using an ORM is not that we are afraid of raw SQL (though you should sometimes...), but the power of the models and queryset classes. We can manipulate models and querysets in a pythonic way, and reuse all kinds of stuff on different models and queries. A queryset can be used to just get a list of instances, filter the change list in the admin, provide choices for a dropdownbox, be used in all kinds of templatetags etc.
Raw SQL queries have a place in django applications, but there are very good reasons to stick with the ORM most of the time...
I totally agree. Not depending on SQL means I can run unit tests in SQLite3 in memory and then manually test and deploy on Postgres.
Staying within the ORM usually means you're not complicating things for yourself. Instead of solving problems by abandoning the ORM and digging deep into the SQL perhaps instead you should walk around the problem and look at it in a different way.
Well, I know what ORM is used for but I offten see that people use it to much. It all fine and dandy what You are saying when u got simple bussnes logic like get list of something, simple predicates and stuff. But when u need to join 10 tables in sophistacated manner orms are usless :)
I cant expect that some code will do everthing for You, and even if it does dont expect it will be the best available way to achive your goal.
Probably I am lucky that I never needed to join 10 tables in a single query. If these questions come up in IRC, most of the time, whoever asked it was thinking much much too complicated.
If however, I really had to join 10 tables, punching out a complex ORM queryset would be the least of my problems. But again, I still expect such cases to be a very small minority.
Most problems I ever encountered map nicely to the ORM if you put a little thought in it.
I guess my point was that, if you need to join 10 tables you probably have made the app too complex. To get out of it, an ORM certainly won't help.
By the way, a good ORM makes it possible to inject arbitrary raw SQL if there's a need for speed or a need for a hack.
Or I had to develop big app with weird client requests :)
For the second part about hacking orm I don't know why using some fancy database features like cte, window function, sql functions, views, database extensions is "hacking". Mixing raw sql with orm imho is out of taste also, once again whats the point? If I need data only for displaying (I do not want to change that data) why would I go through the pain in making complex orm query if I can write nice and easily readable sql query that uses 100% of database ? Again, If I want to make efficient application limiting myself only to features orm provides is wrong. Why would I forbid myself from using all goodies postgres have? Or if I was developing using mysql, orcale etc why should I limit myself only to standard sql queries ?
On the other hand I have no grudge against orms. I use them every day ,but for tasks they fit perfectly. I've seen many blog posts presenting pythonic solutions for some problems
taking tens of lines which could have been done in 10line trigger...
Why? Because it reduces complexity from a code-reading point of view.
Compare the readability of your trigger function with a snippet of plain python ORM code. A new developer joining would have to study the SQL functions, the triggers, the application code.
Large SQL statements are harder to read and harder to debug. You can't put in step-by-step debuggers, debugging print statements, monkeypatching or unit testing.
That's why.
By the way, I maintain a very large web app where the SQL is handcoded (started before ORMs became popular). Hundreds of .sql files. It's fast because the database gets to do so much of the processing. But for basic things (editing a user profile) it's tedious to have to type all the SQL and the app now only works in Postgres meaning I can't run my unit tests in a in-memory SQLite3 database.
Our discussion is going to nowhere. No offence but if sql statements are hard to read and hard to debug don't use sql at all if its to hard to learn it :) The only reason why people are so afraid of using sql is that they don't know it very well so using orm is very cosy.
I know SQL extremely well but I still believe ORMs are generally better than hand-coding SQL and using triggers, functions and views. Especially if it's an app.
I'm a big fan of Redis for solving this kind of problem: stick your database primary keys in a Redis set, then use the lightning fast SRANDMEMEBER command to grab a random ID. Keeping your Redis sets synchronized with your database IDs isn't a huge burden - I usually use a bit of custom code in the .save() and .delete() methods on the Django model, but you could also use Django's signals.
Wow! That's interesting. Considering how fast a primary key index table is, Redis goes to beat it. Thanks for the tip!
BTW. Check the update I added to the blog about selecting 100 random items instead of 10 ten times.
cant you just take the count of your query set, and then pick random rows via slicing?
Count is slower than Max. Also a count will yield a lower number than the MAX(id) meaning the random range in reduced. With slicing, how would you make it random?
example
count = Model.objects.all().count()
randomrow = Model.objects.all()[int(random.random() * count)]
the idea is that since query sets are lazy, only that one row will actually be retrieved.
the reason we want to use count() instead of max, is that the max pk may be significantly higher than the total number of rows available to slice out. i.e. rows have been deleted.
if you wanted 10 random rows, you could just generate 10 random numbers, making sure they're unique, and then iterate over the list of randoms taking that row number from the the query set and appending it to a list or something.
i actually think i first saw this technique in the docks, as a way to avoid mysql's slow randomizing.
a) count() is much slower than max() or using the sequence currval()
b) if you have the IDs 1,2,4 the count will be 3 so you'd have to run randint(1,3). That way you're never picking up ID=4.
c) what does that slice translate to in SQL? Is it "LIMIT 1 OFFSET <python random number>"? If so, don't you have to order?
i don't know about a at all, it's something to test, i'm just throwing out an alternate technique to try.
b) it's zero indexed and you should be adding in checks that int(random() * count) < count
c) you're right about offset and limit, but you don't need an order because you're pulling a random one anyway, so it's irrelevant.
a) trust me, it is. considerably
c) have you tried it? It's ultra-slow
on c, taking a slice rather than get(pk=X)
i just tried it (this is on sqlite though, which just occurred to me) could someone with a real db test this?
t0 = time()
blah = Model.objects.all()[3].pk
print '%f seconds' % (time() - t0)
t0 = time()
blah = model.objects.model.objects.get(pk=2).pk
print '%f seconds' % (time() - t0)
yielded these three trials
0.000898 seconds
0.001472 seconds
0.001804 seconds
0.001447 seconds
0.001948 seconds
0.001504 seconds
when i write out the function, similar to the using_max function, i get these results (first is using count and slices, second is the using_max function
0.007093 seconds
0.010980 seconds
0.006881 seconds
0.011418 seconds
0.006947 seconds
0.011399 seconds
count = model.objects.all().count()
i = 0
while i < TIMES:
try:
yield model.objects.all()[random.randint(0, count-1)].pk
i += 1
except model.DoesNotExist:
pass
like i said though, his is on sqlite, with probably not a representative dataset.
Do it on a proper database like postgres and copy and paste the SQL that Django generates then run it like this:
mydatabase# EXPLAIN ANALYZE <the big sql statement>;
no need, now that i've filled up the database with faked data, we get this:
0.862994 seconds
0.104061 seconds
0.894336 seconds
0.114008 seconds
0.809722 seconds
0.102276 seconds
Are you sure you're doing that right? I just tried it myself and got this:
COUNT 84482
using_max() took 0.613966941833 seconds
using_max2() took 2.08254098892 seconds
using_count_and_slice() took 14.112842083 seconds
Code here: http://www.peterbe.com/plog/getting-random-rows-postgresql-django/get_random_ones.py
no, i think the numbers jive, and you're right.
it seems to be a 7x factor using the count/slice method i was talking about.
Don't think so, out of the 14 seconds about 0.4 seconds is spent getting the counts.
Really nice post, but what is the business value of this. It reminds me to the question "how can I count 8 million rows in a table". And the answer is not "Slow", it is "Why do you have to?".
Taking random entries from a million rows table will not only kill your server but it will be bad for your user`s experience. Do you like sites when the content changes on every refresh? Sound to me like "Hey, wait me to refresh this page 1000 times to show you something great"
E.g. you're building a customer database and you want cold callers to ring randomly selected telephone numbers.
I am not convinced it must be done by random, but maybe you`re right.
But my approach will be to take them in order and record who I called and does this lead to a purchase/contract. The chance to find returning customers is much bigger than convincing a new one if you have some reputation(lets say a 100 000 clients). In the other case when you have less customers and you are trying to find a new ones then calling them randomly has the same value as if it is done in order?
I was just thinking, why not to use "in_bulk" to get a random id's. This could reduce number of requests to the db significantly.