As of PostgreSQL 9.5 we have UPSERT support. Technically, it's
ON CONFLICT, but it's basically a way to execute an
UPDATE statement in case the
INSERT triggers a conflict on some column value. By the way, here's a great blog post that demonstrates how to use
In this Django app I have a model that has a field called
hash which has a
unique=True index on it. What I want to do is either insert a row, or if the
hash is already in there, it should increment the
count and the
modified_at timestamp instead.
Here's the basic version in "pure Django ORM":
if MissingSymbol.objects.filter(hash=hash_). exists(): MissingSymbol.objects.filter(hash=hash_).update( count=F('count') + 1, modified_at=timezone.now() ) else: MissingSymbol.objects.create( hash=hash_, symbol=symbol, debugid=debugid, filename=filename, code_file=code_file or None, code_id=code_id or None, )
Here's that same code rewritten in "pure SQL":
from django.db import connection with connection.cursor() as cursor: cursor.execute(""" INSERT INTO download_missingsymbol ( hash, symbol, debugid, filename, code_file, code_id, count, created_at, modified_at ) VALUES ( %s, %s, %s, %s, %s, %s, 1, CLOCK_TIMESTAMP(), CLOCK_TIMESTAMP() ) ON CONFLICT (hash) DO UPDATE SET count = download_missingsymbol.count + 1, modified_at = CLOCK_TIMESTAMP() WHERE download_missingsymbol.hash = %s """, [ hash_, symbol, debugid, filename, code_file or None, code_id or None, hash_ ] )
Note the use of
CLOCK_TIMESTAMP() instead of
NOW(). Since Django wraps all writes in transactions if you use
NOW() it will be evaluated to the same value for the whole transaction, thus making unit testing really hard.
But which is fastest?
First of all, this hard to test locally because my Postgres is running locally in Docker so the network latency in talking to a network Postgres means that the latency is less and having to do two different executions would cost more if the network latency is more.
I ran a simple benchmark where it randomly picked one of the two code blocks (above) depending on a 50% chance.
The results are:
METHOD MEAN MEDIAN SQL 6.99ms 6.61ms ORM 10.28ms 9.86ms
So doing it with a block of raw SQL instead is 1.5 times faster. But this difference would surely grow when the network latency is higher.
There's an alternative and that's to use
django-postgres-extra but I'm personally hesitant. The above little raw SQL hack is the only thing I need and adding more libraries makes far-future maintenance harder.
Beyond the time optimization of being able to send only 1 SQL instruction to PostgreSQL, the biggest benefit is avoiding concurrency race conditions. From the documentation:
"ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency. This is also known as UPSERT — "UPDATE or INSERT"."
I'm going to keep this little hack. It's not beautiful but it works and saves time and gives me more comfort around race conditions.