Best EXPLAIN ANALYZE benchmark script
April 19, 2018
0 comments Python, PostgreSQL
tl;dr; Use best-explain-analyze.py
to benchmark a SQL query in Postgres.
I often benchmark SQL by extracting the relevant SQL string, prefix it with EXPLAIN ANALYZE
, putting it into a file (e.g. benchmark.sql
) and then running psql mydatabase < benchmark.sql
. That spits out something like this:
QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Index Scan using main_song_ca949605 on main_song (cost=0.43..237.62 rows=1 width=4) (actual time=1.586..1.586 rows=0 loops=1) Index Cond: (artist_id = 27451) Filter: (((name)::text % 'Facing The Abyss'::text) AND (id <> 2856345)) Rows Removed by Filter: 170 Planning time: 3.335 ms Execution time: 1.701 ms (6 rows)
Cool. So you study the steps of the query plan and look for "Seq Scan" and various sub-optimal uses of heaps and indices etc. But often, you really want to just look at the Execution time
milliseconds number. Especially if you might have to slightly different SQL queries to compare and contrast.
However, as you might have noticed, the number on the Execution time
varies between runs. You might think nothing's changed but Postgres might have warmed up some internal caches or your host might be more busy or less busy. To remedy this, you run the EXPLAIN ANALYZE select ...
a couple of times to get a feeling for an average. But there's a much better way!
best-explain-analyze.py
Check this out: best-explain-analyze.py
Download it into your ~/bin/
and chmod +x ~/bin/best-explain-analyze.py
. I wrote it just this morning so don't judge!
Now, when you run it it runs that thing 10 times (by default) and reports the best Execution time
, its mean and its median. Example output:
▶ best-explain-analyze.py songsearch dummy.sql EXECUTION TIME BEST 1.229ms MEAN 1.489ms MEDIAN 1.409ms PLANNING TIME BEST 1.994ms MEAN 4.557ms MEDIAN 2.292ms
The "BEST" is an important metric. More important than mean or median.
Raymond Hettinger explains it better than I do. His context is for benchmarking Python code but it's equally applicable:
"Use the min() rather than the average of the timings. That is a recommendation from me, from Tim Peters, and from Guido van Rossum. The fastest time represents the best an algorithm can perform when the caches are loaded and the system isn't busy with other tasks. All the timings are noisy -- the fastest time is the least noisy. It is easy to show that the fastest timings are the most reproducible and therefore the most useful when timing two different implementations."
Efficient many-to-many field lookup in Django REST Framework
April 11, 2018
1 comment Python, Django, PostgreSQL
The basic setup
Suppose you have these models:
from django.db import models
class Category(models.Model):
name = models.CharField(max_length=100)
class Blogpost(models.Model):
title = models.CharField(max_length=100)
categories = models.ManyToManyField(Category)
Suppose you hook these up Django REST Framework and list all Blogpost
items. Something like this:
# urls.py
from rest_framework import routers
from . import views
router = routers.DefaultRouter()
router.register(r'blogposts', views.BlogpostViewSet)
# views.py
from rest_framework import viewsets
class BlogpostViewSet(viewsets.ModelViewSet):
queryset = Blogpost.objects.all().order_by('date')
serializer_class = serializers.BlogpostSerializer
What's the problem?
Then, if you execute this list (e.g. curl http://localhost:8000/api/blogposts/
) what will happen, on the database, is something like this:
SELECT "app_blogpost"."id", "app_blogpost"."title" FROM "app_blogpost";
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1025;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 193;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 757;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 853;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1116;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1126;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 964;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 591;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1112;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1034;
...
Obviously, it depends on how you define that serializers.BlogpostSerializer
class, but basically, as it loops over the Blogpost
, for each and every one, it needs to make a query to the many-to-many table (app_blogpost_categories
in this example).
That's not going to be performant. In fact, it might be dangerous on your database if the query of blogposts
gets big, like requesting a 100 or 1,000 records. Fetching 1,000 rows from the app_blogpost
table might be cheap'ish but doing 1,000 selects with JOIN
is never going to be cheap. It adds up horribly.
How you solve it
The trick is to only do 1 query on the many-to-many field's table, 1 query on the app_blogpost
table and 1 query on the app_category
table.
First you have to override the ViewSet.list
method. Then, in there you can do exactly what you need.
Here's the framework for this change:
# views.py
from rest_framework import viewsets
class BlogpostViewSet(viewsets.ModelViewSet):
# queryset = Blogpost.objects.all().order_by('date')
serializer_class = serializers.BlogpostSerializer
def get_queryset(self):
# Chances are, you're doing something more advanced here
# like filtering.
Blogpost.objects.all().order_by('date')
def list(self, request, *args, **kwargs):
response = super().list(request, *args, **kwargs)
# Where the magic happens!
return response
Next, we need to make a mapping of all Category.id -1-> Category.name
. But we want to make sure we do only on the categories that are involved in the Blogpost
records that matter. You could do something like this:
category_names = {}
for category in Category.objects.all():
category_names[category.id] = category.name
But to avoid doing a lookup of category names for those you never need, use the query set on Blogpost
. I.e.
qs = self.get_queryset()
all_categories = Category.objects.filter(
id__in=Blogpost.categories.through.objects.filter(
blogpost__in=qs
).values('category_id')
)
category_names = {}
for category in all_categories:
category_names[category.id] = category.name
Now you have a dictionary of all the Category IDs that matter.
Note! The above "optimization" assumes that it's worth it. Meaning, if the number of Category
records in your database is huge, and the Blogpost
queryset is very filtered, then it's worth only extracting a subset. Alternatively, if you only have like 100 different categories in your database, just do the first variant were you look them up "simplestly" without any fancy joins.
Next, is the mapping of Blogpost.id -N-> Category.name
. To do that you need to build up a dictionary (int to list of strings). Like this:
categories_map = defaultdict(list)
for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
categories_map[m2m.blogpost_id].append(
category_names[m2m.category_id]
)
So what we have now is a dictionary whose keys are the IDs in self.get_queryset()
and each value is a list of a strings. E.g. ['Category X', 'Category Z']
etc.
Lastly, we need to put these back into the serialized response. This feels a little hackish but it works:
for each in response.data:
each['categories'] = categories_map.get(each['id'], [])
The whole solution looks something like this:
# views.py
from rest_framework import viewsets
class BlogpostViewSet(viewsets.ModelViewSet):
# queryset = Blogpost.objects.all().order_by('date')
serializer_class = serializers.BlogpostSerializer
def get_queryset(self):
# Chances are, you're doing something more advanced here
# like filtering.
Blogpost.objects.all().order_by('date')
def list(self, request, *args, **kwargs):
response = super().list(request, *args, **kwargs)
qs = self.get_queryset()
all_categories = Category.objects.filter(
id__in=Blogpost.categories.through.objects.filter(
blogpost__in=qs
).values('category_id')
)
category_names = {}
for category in all_categories:
category_names[category.id] = category.name
categories_map = defaultdict(list)
for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
categories_map[m2m.blogpost_id].append(
category_names[m2m.category_id]
)
for each in response.data:
each['categories'] = categories_map.get(each['id'], [])
return response
It's arguably not very pretty but doing 3 tight queries instead of doing as many queries as you have records is much better. O(c)
is better than O(n)
.
Discussion
Perhaps the best solution is to not run into this problem. Like, don't serialize any many-to-many fields.
Or, if you use pagination very conservatively, and only allow like 10 items per page then it won't be so expensive to do one query per every many-to-many field.
Show size of every PostgreSQL database you have
February 7, 2018
0 comments PostgreSQL
tl;dr; SELECT pg_database.datname, pg_database_size(pg_database.datname), pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database ORDER by 2 DESC;
I recently had to transfer all my postgres data for my local databases on the laptop. Although I, unfortunately, can't remember what it was before I deleted a bunch of databases, now after some clean up, all my PostgreSQL databases weighs 12GB.
To find out the size of all your database, start psql
like this:
$ psql postgres
Then run this:
SELECT pg_database.datname,
pg_database_size(pg_database.datname),
pg_size_pretty(pg_database_size(pg_database.datname))
FROM pg_database ORDER by 2 DESC;
Here's what it looked like on my laptop:
postgres=# SELECT pg_database.datname, postgres-# pg_database_size(pg_database.datname), postgres-# pg_size_pretty(pg_database_size(pg_database.datname)) postgres-# FROM pg_database ORDER by 2 DESC; datname | pg_database_size | pg_size_pretty --------------------------+------------------+---------------- songsearch | 10689224876 | 10194 MB air_mozilla_org | 355639812 | 339 MB kl | 239297028 | 228 MB kl2 | 239256068 | 228 MB peterbecom | 191914500 | 183 MB kintobench | 125968556 | 120 MB airmozilla | 41640452 | 40 MB socorro_webapp | 32530948 | 31 MB tecken | 26706092 | 25 MB dailycookie | 12935684 | 12 MB autocompeter | 12313092 | 12 MB socorro_integration_test | 11428356 | 11 MB breakpad | 11313668 | 11 MB test_peterbecom | 9298436 | 9081 kB battleshits | 9028100 | 8817 kB thuawood2 | 8716804 | 8513 kB thuawood | 8667652 | 8465 kB fastestdb | 8012292 | 7825 kB premailer | 7676420 | 7497 kB crontabber | 7586308 | 7409 kB postgres | 7536812 | 7360 kB crontabber_exampleapp | 7488004 | 7313 kB whatsdeployed | 7414276 | 7241 kB socorro_test | 7315972 | 7145 kB template1 | 7307780 | 7137 kB template0 | 7143940 | 6977 kB (26 rows)
Conditional aggregation in Django 2.0
January 12, 2018
4 comments Python, Django, PostgreSQL
Django 2.0 came out a couple of weeks ago. It now supports "conditional aggregation" which is SQL standard I didn't even know about.
Before
So I have a Django app which has an endpoint that generates some human-friendly stats about the number of uploads (and their total size) in various different time intervals.
First of all, this is how it set up the time intervals:
today = timezone.now()
start_today = today.replace(hour=0, minute=0, second=0)
start_yesterday = start_today - datetime.timedelta(days=1)
start_this_month = today.replace(day=1)
start_this_year = start_this_month.replace(month=1)
And then, for each of these, there's a little function that returns a dict for each time interval:
def count_and_size(qs, start, end):
sub_qs = qs.filter(created_at__gte=start, created_at__lt=end)
return {
'count': sub_qs.count(),
'total_size': sub_qs.aggregate(size=Sum('size'))['size'],
}
numbers['uploads'] = {
'today': count_and_size(upload_qs, start_today, today),
'yesterday': count_and_size(upload_qs, start_yesterday, start_today),
'this_month': count_and_size(upload_qs, start_this_month, today),
'this_year': count_and_size(upload_qs, start_this_year, today),
}
What you get is exactly 2 x 4 = 8 queries. One COUNT
and one SUM
for each time interval. E.g.
SELECT SUM("upload_upload"."size") AS "size" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... SELECT COUNT(*) AS "__count" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... ...6 more queries...
Middle
Oops. I think this code comes from a slightly rushed job. We can do the COUNT
and the SUM
at the same time for each query.
# New, improved count_and_size() function!
def count_and_size(qs, start, end):
sub_qs = qs.filter(created_at__gte=start, created_at__lt=end)
return sub_qs.aggregate(
count=Count('id'),
total_size=Sum('size'),
)
numbers['uploads'] = {
'today': count_and_size(upload_qs, start_today, today),
'yesterday': count_and_size(upload_qs, start_yesterday, start_today),
'this_month': count_and_size(upload_qs, start_this_month, today),
'this_year': count_and_size(upload_qs, start_this_year, today),
}
Much better, now there's only one query per time bucket. So 4 queries in total. E.g.
SELECT COUNT("upload_upload"."id") AS "count", SUM("upload_upload"."size") AS "total_size" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... ...3 more queries...
After
But we can do better than that! Instead, we use conditional aggregation. The syntax gets a bit hairy because there's so many keyword arguments, but I hope I've indented it nicely so it's easy to see how it works:
def make_q(start, end):
return Q(created_at__gte=start, created_at__lt=end)
q_today = make_q(start_today, today)
q_yesterday = make_q(start_yesterday, start_today)
q_this_month = make_q(start_this_month, today)
q_this_year = make_q(start_this_year, today)
aggregates = upload_qs.aggregate(
today_count=Count('pk', filter=q_today),
today_total_size=Sum('size', filter=q_today),
yesterday_count=Count('pk', filter=q_yesterday),
yesterday_total_size=Sum('size', filter=q_yesterday),
this_month_count=Count('pk', filter=q_this_month),
this_month_total_size=Sum('size', filter=q_this_month),
this_year_count=Count('pk', filter=q_this_year),
this_year_total_size=Sum('size', filter=q_this_year),
)
numbers['uploads'] = {
'today': {
'count': aggregates['today_count'],
'total_size': aggregates['today_total_size'],
},
'yesterday': {
'count': aggregates['yesterday_count'],
'total_size': aggregates['yesterday_total_size'],
},
'this_month': {
'count': aggregates['this_month_count'],
'total_size': aggregates['this_month_total_size'],
},
'this_year': {
'count': aggregates['this_year_count'],
'total_size': aggregates['this_year_total_size'],
},
}
Voila! One single query to get all those pieces of data.
The SQL sent to PostgreSQL looks something like this:
SELECT COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_total_size", COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_total_size", ... FROM "upload_upload";
Is this the best thing to do? I'm starting to have my doubts.
Watch Out!
When I take this now 1 monster query for a spin with an EXPLAIN ANALYZE
prefix I notice something worrying!
QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Aggregate (cost=74.33..74.34 rows=1 width=16) (actual time=0.587..0.587 rows=1 loops=1) -> Seq Scan on upload_upload (cost=0.00..62.13 rows=813 width=16) (actual time=0.012..0.210 rows=813 loops=1) Planning time: 0.427 ms Execution time: 0.674 ms (4 rows)
A sequential scan! That's terrible. The created_at
column is indexed in a BTREE
so why can't it use the index.
The short answer is: I don't know!
I've uploaded a reduced, but still complete, example demonstrating this in a gist. It's very similar to the example in the stackoverflow question I asked.
So what did I do? I went back to the "middle" solution. One SELECT
query per time bucket. So 4 queries in total, but at least all 4 is able to use an index.
Synonyms with elasticsearch-dsl
December 5, 2017
0 comments Python, Web development, PostgreSQL
The documentation about how to use synonyms in Elasticsearch is good but because it's such an advanced topic, even if you read the documentation carefully, you're still left with lots of questions. Let me show you some things I've learned about how to use synonyms in Python with elasticsearch-dsl.
What's the nature of your documents?
I'm originally from Sweden but moved to London, UK in 1999 and started blogging a few years after. So I wrote most of my English with British English spelling. E.g. "centre" instead of "center". Later I moved to California in the US and slowly started to change my own English over to American English. I kept blogging but now I would prefer to write "center" instead of "centre".
Another example... Certain technical words or namings are tricky. For example, is it "go" or is it "golang"? Is it "React" or is it "ReactJS"? Is it "PostgreSQL" or "Postgres". I never know. Not only is it sometimes hard to know which is right because people use them differently, but also sometimes "brands" like that change over time since inception, the creator might have preferred something but the masses of people call it something else.
So with all that in mind, not only has the nature of my documents (my blog post texts) changed in terminology over the years. My visitors are also coming both from British English and American English. Or, suppose that I knew the perfect way to phrase that relational database that starts with "Postg...". Even if my text is always spelled one particular way, perfectly, my visitors will most likely refer to it as "postgres" sometimes and "postgresql" sometimes.
The simple solution, match all!
Create a custom analyzer
Let's jump straight into the code. People who have used elasticsearch_dsl
should be familiar with most of this:
from elasticsearch_dsl import (
DocType,
Text,
Index,
analyzer,
Keyword,
token_filter,
)
from django.conf import settings
index = Index(settings.ES_INDEX)
index.settings(**settings.ES_INDEX_SETTINGS)
synonym_tokenfilter = token_filter(
'synonym_tokenfilter',
'synonym',
synonyms=[
'reactjs, react', # <-- important
],
)
text_analyzer = analyzer(
'text_analyzer',
tokenizer='standard',
filter=[
# The ORDER is important here.
'standard',
'lowercase',
'stop',
synonym_tokenfilter,
# Note! 'snowball' comes after 'synonym_tokenfilter'
'snowball',
],
char_filter=['html_strip']
)
class BlogItemDoc(DocType):
oid = Keyword(required=True)
title = Text(
required=True,
analyzer=text_analyzer
)
text = Text(analyzer=text_analyzer)
index.doc_type(BlogItemDoc)
This code above is copied from the "real code" but a lot of distracting things that aren't important to the point, have been removed.
The magic sauce here is that you create a token_filter
and you can call it whatever you want. I called mine synonym_tokenfilter
and that's also what the instance variable is called.
Notice the list of synonyms. It's a plain list of strings. Specifically, it's a list of 1 string reactjs, react
.
Let's see how Elasticsearch analyzes this:
First with the text react
.
$ curl -XGET 'http://127.0.0.1:9200/peterbecom/_analyze?analyzer=text_analyzer&text=react&pretty=1' { "tokens" : [ { "token" : "react", "start_offset" : 0, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "reactj", "start_offset" : 0, "end_offset" : 5, "type" : "SYNONYM", "position" : 0 } ] }
Note that the analyzer snowball
, converted reactjs
to reactj
which is wrong in a sense, because there's not plural "reacts", but it ultimately doesn't matter much. At least not in this particular case.
Secondly, analyze it with the text reactjs
:
$ curl -XGET 'http://127.0.0.1:9200/peterbecom/_analyze?analyzer=text_analyzer&text=reactjs&pretty=1' { "tokens" : [ { "token" : "reactj", "start_offset" : 0, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "react", "start_offset" : 0, "end_offset" : 7, "type" : "SYNONYM", "position" : 0 } ] }
Same tokens! Just different order.
Test it for reals
Now, the real proof is in actually doing a search on this. Look at these two screenshots:
It worked! Different ways of phrasing your search but ultimately found all the documents that matched independent of different people or different authors might prefer to spell it.
Try it for yourself:
- https://www.peterbe.com/search?q=reactjs&debug-search=1
- https://www.peterbe.com/search?q=react&debug-search=1
- https://www.peterbe.com/search?q=postgres&debug-search=1
- https://www.peterbe.com/search?q=postgresql&debug-search=1
What it looked like before
Check out these two screenshots of how it would look like before, when synonyms for postgres
and postgresql
had not been set up yet:
One immediate thought I have is what a mess I've been in blogging about that database. Clearly I struggled to pick one way to spell it consistently.
And here's what it would look like once that synonym has been set up:
"go" versus "golang"
Go is a programming language. That term, too, struggles with a name ambiguity. Granted, I rarely hear people say "golang", but it's definitely a written word that turns up a lot.
The problem with setting up a synonym for go == golang
is that "go" is common English word. It's also the stem of the word "going" and such. So if you set up a synonym, like I did for react
and reactjs
above, this is what happens:
This is now the exact search results as if I had searched for go
. But look what it matched! It matched "Go" (good) but also "Going real simple..." (bad) and "...I should go" (bad).
If someone searches for the simple term "go" they probably intend to search for the Go programming language. All that snowball stemming is critical for a bunch of other non-computer-term searches so we can't remove the stemming.
The solution is to use what's called "Simple Contraction". And it looks like this:
all_synonyms = [
'go => golang',
'react => reactjs',
'postgres => postgresql',
]
That basically means that a search for go
is a search for golang
. And a document that uses the word go
(alone) is indexed as golang
.
What happens is that the word go
gets converted to golang
which doesn't get stemming converted down to any other forms.
However, this is no silver bullet. Any search for the term go
is ultimately a search for the word golang
and the regular English word go
. So the benefit of all of this was that we got rid of search results matching on going
and gone
.
What you have to decide...
The case for go
is similar to the case for react
. Both of these words are nouns but they're also verbs.
Should people find "reacting to events" when they search for "react"?
If so, use react, reactjs
in the synonyms list.
Should people only find documents related to noun "React" when they search for "event handing in react"?
If so, use react => reactjs
in the synonyms list.
It's up to you and your documents and what your users tend to search for.
Bonus! For American vs British English
AVKO.org publishes a list of all British to American English synonyms. You can download the whole list here. Unfortunately I can't find a license for this file but the compiled synonyms file is part of this repo which is licensed under MIT.
I download this list and keep it in the repo. Then when setting up the analyzer and token filters I load it in like this:
synonyms_root = os.path.join(
settings.BASE_DIR, 'peterbecom/es-synonyms'
)
american_british_syns_fn = os.path.join(
synonyms_root, 'be-ae.synonyms'
)
with open(american_british_syns_fn) as f:
for line in f:
if (
'=>' not in line or
line.strip().startswith('#')
):
continue
all_synonyms.append(line.strip())
Now I can finally enjoy not having to worry about the fact that sometimes I spell it "license" and sometimes I spell it "licence". It's all the same now. Brits and Americans, rejoice on common ground!
Bonus! For terrible spellers
Although I don't have a big problem with this on my techy blog but you can use the Simple Contraction technique to list unambiguously bad spelling. Add dont => don't
to the list of synonyms and a search for dont
is a search for don't
.
Last but not least, the official Elasticsearch documentation is the place to go. This blog post hopefully phrases it in more approachable terms. Especially for Python peeps.
Simple or fancy UPSERT in PostgreSQL with Django
October 11, 2017
7 comments Python, Web development, Django, PostgreSQL
As of PostgreSQL 9.5 we have UPSERT support. Technically, it's ON CONFLICT
, but it's basically a way to execute an UPDATE
statement in case the INSERT
triggers a conflict on some column value. By the way, here's a great blog post that demonstrates how to use ON CONFLICT
.
In this Django app I have a model that has a field called hash
which has a unique=True
index on it. What I want to do is either insert a row, or if the hash
is already in there, it should increment the count
and the modified_at
timestamp instead.
The Code(s)
Here's the basic version in "pure Django ORM":
if MissingSymbol.objects.filter(hash=hash_).exists():
MissingSymbol.objects.filter(hash=hash_).update(
count=F('count') + 1,
modified_at=timezone.now()
)
else:
MissingSymbol.objects.create(
hash=hash_,
symbol=symbol,
debugid=debugid,
filename=filename,
code_file=code_file or None,
code_id=code_id or None,
)
Here's that same code rewritten in "pure SQL":
from django.db import connection
with connection.cursor() as cursor:
cursor.execute("""
INSERT INTO download_missingsymbol (
hash, symbol, debugid, filename, code_file, code_id,
count, created_at, modified_at
) VALUES (
%s, %s, %s, %s, %s, %s,
1, CLOCK_TIMESTAMP(), CLOCK_TIMESTAMP()
)
ON CONFLICT (hash)
DO UPDATE SET
count = download_missingsymbol.count + 1,
modified_at = CLOCK_TIMESTAMP()
WHERE download_missingsymbol.hash = %s
""", [
hash_, symbol, debugid, filename,
code_file or None, code_id or None,
hash_
]
)
Both work.
Note the use of CLOCK_TIMESTAMP()
instead of NOW()
. Since Django wraps all writes in transactions if you use NOW()
it will be evaluated to the same value for the whole transaction, thus making unit testing really hard.
But which is fastest?
The Results
First of all, this hard to test locally because my Postgres is running locally in Docker so the network latency in talking to a network Postgres means that the latency is less and having to do two different executions would cost more if the network latency is more.
I ran a simple benchmark where it randomly picked one of the two code blocks (above) depending on a 50% chance.
The results are:
METHOD MEAN MEDIAN SQL 6.99ms 6.61ms ORM 10.28ms 9.86ms
So doing it with a block of raw SQL instead is 1.5 times faster. But this difference would surely grow when the network latency is higher.
Discussion
There's an alternative and that's to use django-postgres-extra
but I'm personally hesitant. The above little raw SQL hack is the only thing I need and adding more libraries makes far-future maintenance harder.
Beyond the time optimization of being able to send only 1 SQL instruction to PostgreSQL, the biggest benefit is avoiding concurrency race conditions. From the documentation:
"ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency. This is also known as UPSERT — "UPDATE or INSERT"."
I'm going to keep this little hack. It's not beautiful but it works and saves time and gives me more comfort around race conditions.