As of PostgreSQL 9.5 we have UPSERT support. Technically, it's ON CONFLICT, but it's basically a way to execute an UPDATE statement in case the INSERT triggers a conflict on some column value. By the way, here's a great blog post that demonstrates how to use ON CONFLICT.

In this Django app I have a model that has a field called hash which has a unique=True index on it. What I want to do is either insert a row, or if the hash is already in there, it should increment the count and the modified_at timestamp instead.

The Code(s)

Here's the basic version in "pure Django ORM":


if MissingSymbol.objects.filter(hash=hash_).exists():
    MissingSymbol.objects.filter(hash=hash_).update(
        count=F('count') + 1,
        modified_at=timezone.now()
    )
else:
    MissingSymbol.objects.create(
        hash=hash_,
        symbol=symbol,
        debugid=debugid,
        filename=filename,
        code_file=code_file or None,
        code_id=code_id or None,
    )

Here's that same code rewritten in "pure SQL":


from django.db import connection


with connection.cursor() as cursor:
    cursor.execute("""
        INSERT INTO download_missingsymbol (
            hash, symbol, debugid, filename, code_file, code_id,
            count, created_at, modified_at
        ) VALUES (
            %s, %s, %s, %s, %s, %s,
            1, CLOCK_TIMESTAMP(), CLOCK_TIMESTAMP()
          )
        ON CONFLICT (hash)
        DO UPDATE SET
            count = download_missingsymbol.count + 1,
            modified_at = CLOCK_TIMESTAMP()
        WHERE download_missingsymbol.hash = %s
        """, [
            hash_, symbol, debugid, filename,
            code_file or None, code_id or None,
            hash_
        ]
    )

Both work.

Note the use of CLOCK_TIMESTAMP() instead of NOW(). Since Django wraps all writes in transactions if you use NOW() it will be evaluated to the same value for the whole transaction, thus making unit testing really hard.

But which is fastest?

The Results

First of all, this hard to test locally because my Postgres is running locally in Docker so the network latency in talking to a network Postgres means that the latency is less and having to do two different executions would cost more if the network latency is more.

I ran a simple benchmark where it randomly picked one of the two code blocks (above) depending on a 50% chance.
The results are:

METHOD     MEAN       MEDIAN
SQL        6.99ms     6.61ms
ORM        10.28ms    9.86ms

So doing it with a block of raw SQL instead is 1.5 times faster. But this difference would surely grow when the network latency is higher.

Discussion

There's an alternative and that's to use django-postgres-extra but I'm personally hesitant. The above little raw SQL hack is the only thing I need and adding more libraries makes far-future maintenance harder.

Beyond the time optimization of being able to send only 1 SQL instruction to PostgreSQL, the biggest benefit is avoiding concurrency race conditions. From the documentation:

"ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency. This is also known as UPSERT — "UPDATE or INSERT"."

I'm going to keep this little hack. It's not beautiful but it works and saves time and gives me more comfort around race conditions.

Comments

Post your own comment
Haki benita

Hey Peter,
a better approach for the orm would be to first update,

`updated = Model.objects.filter(...).update(..)`

And insert only of no rows were updated
`if updated == 0: # insert`

The comparison between the two approachs depends on wheather you have more updates or inserts.

Peter Bengtsson

Worth a thought. I should re-run my little benchmark if I have near 100% updates done.

Peter Bengtsson

But that would result in two queries.

Emmanuel Briot

The Django version is definitely worse: it has race conditions. If there are two threads (two queries) going through the code simultaneously , they might both end up trying to insert the object, and one of them will get an error. The SQL version does not have the issue. You mentioned this in a single sentence in the discussion, but this is a major win for the SQL version.

The SQL version is bad, too, since it is subject to SQL injection. I am not sure what the proper way to do that with Django is, though.

Peter Bengtsson

What SQL injection? There the arguments are always escaped. That's no different from how the Django ORM escapes arguments when you do something like `MyModel.objects.all().update(...)`

Emmanuel Briot

You are right, sorry. I thought you were using string substitution ('%") to insert hash_. filename,... which would obviously be wrong. Your approach works.

Juan Margot

Seconding this comment. The Django ORM version should basically never be used in any real world situation. Very susceptible to race conditions.

Your email will never ever be published.

Related posts