Simple or fancy UPSERT in PostgreSQL with Django

Wednesday, Oct 11, 2017
7 comments Python, Web development, Django, PostgreSQL

As of PostgreSQL 9.5 we have UPSERT support. Technically, it's ON CONFLICT, but it's basically a way to execute an UPDATE statement in case the INSERT triggers a conflict on some column value. By the way, here's a great blog post that demonstrates how to use ON CONFLICT.

In this Django app I have a model that has a field called hash which has a unique=True index on it. What I want to do is either insert a row, or if the hash is already in there, it should increment the count and the modified_at timestamp instead.

The Code(s)

Here's the basic version in "pure Django ORM":


if MissingSymbol.objects.filter(hash=hash_).exists():
    MissingSymbol.objects.filter(hash=hash_).update(
        count=F('count') + 1,
        modified_at=timezone.now()
    )
else:
    MissingSymbol.objects.create(
        hash=hash_,
        symbol=symbol,
        debugid=debugid,
        filename=filename,
        code_file=code_file or None,
        code_id=code_id or None,
    )

Here's that same code rewritten in "pure SQL":


from django.db import connection


with connection.cursor() as cursor:
    cursor.execute("""
        INSERT INTO download_missingsymbol (
            hash, symbol, debugid, filename, code_file, code_id,
            count, created_at, modified_at
        ) VALUES (
            %s, %s, %s, %s, %s, %s,
            1, CLOCK_TIMESTAMP(), CLOCK_TIMESTAMP()
          )
        ON CONFLICT (hash)
        DO UPDATE SET
            count = download_missingsymbol.count + 1,
            modified_at = CLOCK_TIMESTAMP()
        WHERE download_missingsymbol.hash = %s
        """, [
            hash_, symbol, debugid, filename,
            code_file or None, code_id or None,
            hash_
        ]
    )

Both work.

Note the use of CLOCK_TIMESTAMP() instead of NOW(). Since Django wraps all writes in transactions if you use NOW() it will be evaluated to the same value for the whole transaction, thus making unit testing really hard.

But which is fastest?

The Results

First of all, this hard to test locally because my Postgres is running locally in Docker so the network latency in talking to a network Postgres means that the latency is less and having to do two different executions would cost more if the network latency is more.

I ran a simple benchmark where it randomly picked one of the two code blocks (above) depending on a 50% chance.
The results are:

METHOD     MEAN       MEDIAN
SQL        6.99ms     6.61ms
ORM        10.28ms    9.86ms

So doing it with a block of raw SQL instead is 1.5 times faster. But this difference would surely grow when the network latency is higher.

Discussion

There's an alternative and that's to use django-postgres-extra but I'm personally hesitant. The above little raw SQL hack is the only thing I need and adding more libraries makes far-future maintenance harder.

Beyond the time optimization of being able to send only 1 SQL instruction to PostgreSQL, the biggest benefit is avoiding concurrency race conditions. From the documentation:

"ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency. This is also known as UPSERT — "UPDATE or INSERT"."

I'm going to keep this little hack. It's not beautiful but it works and saves time and gives me more comfort around race conditions.

Comments

Post your own comment

Haki benita October 12, 2017

Hey Peter,
a better approach for the orm would be to first update,

`updated = Model.objects.filter(...).update(..)`

And insert only of no rows were updated
`if updated == 0: # insert`

The comparison between the two approachs depends on wheather you have more updates or inserts.

Peter Bengtsson October 13, 2017

Worth a thought. I should re-run my little benchmark if I have near 100% updates done.

Peter Bengtsson October 12, 2017

But that would result in two queries.

Emmanuel Briot October 13, 2017

The Django version is definitely worse: it has race conditions. If there are two threads (two queries) going through the code simultaneously , they might both end up trying to insert the object, and one of them will get an error. The SQL version does not have the issue. You mentioned this in a single sentence in the discussion, but this is a major win for the SQL version.

The SQL version is bad, too, since it is subject to SQL injection. I am not sure what the proper way to do that with Django is, though.

Peter Bengtsson October 13, 2017

What SQL injection? There the arguments are always escaped. That's no different from how the Django ORM escapes arguments when you do something like `MyModel.objects.all().update(...)`

Emmanuel Briot October 13, 2017

You are right, sorry. I thought you were using string substitution ('%") to insert hash_. filename,... which would obviously be wrong. Your approach works.

Juan Margot August 9, 2019

Seconding this comment. The Django ORM version should basically never be used in any real world situation. Very susceptible to race conditions.

Previous:: "No space left on device" on OSX Docker October 3, 2017 Web development, Docker, macOS
Next:: Concurrent Gzip in Python October 13, 2017 Python, Linux, Docker

Related by category:: Native connection pooling in Django 5 with PostgreSQL June 25, 2025 Django, Python; A Python dict that can report which keys you did not use June 12, 2025 Python; How I run standalone Python in 2025 January 14, 2025 Python; How to resolve a git conflict in poetry.lock February 7, 2020 Python

Simple or fancy UPSERT in PostgreSQL with Django

The Code(s)

The Results

Discussion

Comments

Related posts