As of PostgreSQL 9.5 we have UPSERT support. Technically, it's ON CONFLICT
, but it's basically a way to execute an UPDATE
statement in case the INSERT
triggers a conflict on some column value. By the way, here's a great blog post that demonstrates how to use ON CONFLICT
.
In this Django app I have a model that has a field called hash
which has a unique=True
index on it. What I want to do is either insert a row, or if the hash
is already in there, it should increment the count
and the modified_at
timestamp instead.
The Code(s)
Here's the basic version in "pure Django ORM":
if MissingSymbol.objects.filter(hash=hash_).exists():
MissingSymbol.objects.filter(hash=hash_).update(
count=F('count') + 1,
modified_at=timezone.now()
)
else:
MissingSymbol.objects.create(
hash=hash_,
symbol=symbol,
debugid=debugid,
filename=filename,
code_file=code_file or None,
code_id=code_id or None,
)
Here's that same code rewritten in "pure SQL":
from django.db import connection
with connection.cursor() as cursor:
cursor.execute("""
INSERT INTO download_missingsymbol (
hash, symbol, debugid, filename, code_file, code_id,
count, created_at, modified_at
) VALUES (
%s, %s, %s, %s, %s, %s,
1, CLOCK_TIMESTAMP(), CLOCK_TIMESTAMP()
)
ON CONFLICT (hash)
DO UPDATE SET
count = download_missingsymbol.count + 1,
modified_at = CLOCK_TIMESTAMP()
WHERE download_missingsymbol.hash = %s
""", [
hash_, symbol, debugid, filename,
code_file or None, code_id or None,
hash_
]
)
Both work.
Note the use of CLOCK_TIMESTAMP()
instead of NOW()
. Since Django wraps all writes in transactions if you use NOW()
it will be evaluated to the same value for the whole transaction, thus making unit testing really hard.
But which is fastest?
The Results
First of all, this hard to test locally because my Postgres is running locally in Docker so the network latency in talking to a network Postgres means that the latency is less and having to do two different executions would cost more if the network latency is more.
I ran a simple benchmark where it randomly picked one of the two code blocks (above) depending on a 50% chance.
The results are:
METHOD MEAN MEDIAN SQL 6.99ms 6.61ms ORM 10.28ms 9.86ms
So doing it with a block of raw SQL instead is 1.5 times faster. But this difference would surely grow when the network latency is higher.
Discussion
There's an alternative and that's to use django-postgres-extra
but I'm personally hesitant. The above little raw SQL hack is the only thing I need and adding more libraries makes far-future maintenance harder.
Beyond the time optimization of being able to send only 1 SQL instruction to PostgreSQL, the biggest benefit is avoiding concurrency race conditions. From the documentation:
"ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency. This is also known as UPSERT — "UPDATE or INSERT"."
I'm going to keep this little hack. It's not beautiful but it works and saves time and gives me more comfort around race conditions.
Comments
Post your own commentHey Peter,
a better approach for the orm would be to first update,
`updated = Model.objects.filter(...).update(..)`
And insert only of no rows were updated
`if updated == 0: # insert`
The comparison between the two approachs depends on wheather you have more updates or inserts.
Worth a thought. I should re-run my little benchmark if I have near 100% updates done.
But that would result in two queries.
The Django version is definitely worse: it has race conditions. If there are two threads (two queries) going through the code simultaneously , they might both end up trying to insert the object, and one of them will get an error. The SQL version does not have the issue. You mentioned this in a single sentence in the discussion, but this is a major win for the SQL version.
The SQL version is bad, too, since it is subject to SQL injection. I am not sure what the proper way to do that with Django is, though.
What SQL injection? There the arguments are always escaped. That's no different from how the Django ORM escapes arguments when you do something like `MyModel.objects.all().update(...)`
You are right, sorry. I thought you were using string substitution ('%") to insert hash_. filename,... which would obviously be wrong. Your approach works.
Seconding this comment. The Django ORM version should basically never be used in any real world situation. Very susceptible to race conditions.