How much faster is Redis at storing a blob of JSON compared to PostgreSQL?

28 September 2019   59 comments   Python, PostgreSQL, Redis

tl;dr; Redis is 16 times faster at reading these JSON blobs.

In Song Search when you've found a song, it loads some affiliate links to Amazon.com. (In case you're curious it's earning me lower double-digit dollars per month). To avoid overloading the Amazon Affiliate Product API, after I've queried their API, I store that result in my own database along with some metadata. Then, the next time someone views that song page, it can read from my local database. With me so far?

Example view of affiliate links

The other caveat is that you can't store these lookups locally too long since prices change and/or results change. So if my own stored result is older than a couple of hundred days, I delete it and fetch from the network again. My current implementation uses PostgreSQL (via the Django ORM) to store this stuff. The model looks like this:

class AmazonAffiliateLookup(models.Model, TotalCountMixin):
    song = models.ForeignKey(Song, on_delete=models.CASCADE)
    matches = JSONField(null=True)
    search_index = models.CharField(max_length=100, null=True)
    lookup_seconds = models.FloatField(null=True)
    created = models.DateTimeField(auto_now_add=True, db_index=True)
    modified = models.DateTimeField(auto_now=True)

At the moment this database table is 3GB on disk.

Then, I thought, why not use Redis for this. Then I can use Redis's "natural" expiration by simply setting as expiry time when I store it and then I don't have to worry about cleaning up old stuff at all.

The way I'm using Redis in this project is as a/the cache backend and I have it configured like this:

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": REDIS_URL,
        "TIMEOUT": config("CACHE_TIMEOUT", 500),
        "KEY_PREFIX": config("CACHE_KEY_PREFIX", ""),
        "OPTIONS": {
            "COMPRESSOR": "django_redis.compressors.zlib.ZlibCompressor",
            "SERIALIZER": "django_redis.serializers.msgpack.MSGPackSerializer",
        },
    }
}

The speed difference

Perhaps unrealistic but I'm doing all this testing here on my MacBook Pro. The connection to Postgres (version 11.4) and Redis (3.2.1) are both on localhost.

Reads

The reads are the most important because hopefully, they happen 10x more than writes as several people can benefit from previous saves.

I changed my code so that it would do a read from both databases and if it was found in both, write down their time in a log file which I'll later summarize. Results are as follows:

PG:
median: 8.66ms
mean  : 11.18ms
stdev : 19.48ms

Redis:
median: 0.53ms
mean  : 0.84ms
stdev : 2.26ms

(310 measurements)

It means, when focussing on the median, Redis is 16 times faster than PostgreSQL at reading these JSON blobs.

Writes

The writes are less important but due to the synchronous nature of my Django, the unlucky user who triggers a look up that I didn't have, will have to wait for the write before the XHR request can be completed. However, when this happens, the remote network call to the Amazon Product API is bound to be much slower. Results are as follows:

PG:
median: 8.59ms
mean  : 8.58ms
stdev : 6.78ms

Redis:
median: 0.44ms
mean  : 0.49ms
stdev : 0.27ms

(137 measurements)

It means, when focussing on the median, Redis is 20 times faster than PostgreSQL at writing these JSON blobs.

Conclusion and discussion

First of all, I'm still a PostgreSQL fan-boy and have no intention of ceasing that. These times are made up of much more than just the individual databases. For example, the PostgreSQL speeds depend on the Django ORM code that makes the SQL and sends the query and then turns it into the model instance. I don't know what the proportions are between that and the actual bytes-from-PG's-disk times. But I'm not sure I care either. The tooling around the database is inevitable mostly and it's what matters to users.

Both Redis and PostgreSQL are persistent and survive server restarts and crashes etc. And you get so many more "batch related" features with PostgreSQL if you need them, such as being able to get a list of the last 10 rows added for some post-processing batch job.

I'm currently using Django's cache framework, with Redis as its backend, and it's a cache framework. It's not meant to be a persistent database. I like the idea that if I really have to I can just flush the cache and although detrimental to performance (temporarily) it shouldn't be a disaster. So I think what I'll do is store these JSON blobs in both databases. Yes, it means roughly 6GB of SSD storage but it also potentially means loading a LOT more into RAM on my limited server. That extra RAM usage pretty much sums of this whole blog post; of course it's faster if you can rely on RAM instead of disk. Now I just need to figure out how RAM I can afford myself for this piece and whether it's worth it.

UPDATE September 29, 2019

I experimented with an optimization of NOT turning the Django ORM query into a model instance for each record. Instead, I did this:

+from dataclasses import dataclass


+@dataclass
+class _Lookup:
+    modified: datetime.datetime
+    matches: list

...

+base_qs = base_qs.values_list("modified", "matches")
-lookup = base_qs.get(song__id=song_id)
+lookup_tuple = base_qs.get(song__id=song_id)
+lookup = _Lookup(*lookup_tuple)

print(lookup.modified)

Basically, let the SQL driver's "raw Python" content come through the Django ORM. The old difference between PostgreSQL and Redis was 16x. The new difference was 14x instead.

Comments

Arakel

Did you prewarm the table in PG? Did you use indexes? Which type of indexes? Did you tune up your db?
You cant simply compare base configuration performance of in-memory db and PG. You will need some careful configuration and then most probably you'll get much better results.

Peter Bengtsson

The Redis isn't perfectly configured either. Maybe if PG was better configured, the difference would be 14x instead of 16x.

They aren't both apples. It's two different databases with their individual strengths and weaknesses. But you can use them for very similar applications, so it helps to be fully aware of their characteristics.

Łukasz Biały

Wouldn't it make sense to optimize caching in psql first (3gb db is something psql could keep in mem all the time easily) and then measure the ORM overhead + drop ORM on the hot path and replace it with plain sql DAO? You essentially *could* get a huge perf boost without increasing your data layer complexity I think.

Peter Bengtsson

The whole database is about 35GB. That one table, with the JSON blobs, is about 3GB.

But, I can think of some simple optimizations I could do, which is to use the ORM to make the SQL for me, but to not dress it up in and an ORM model instance.

Jason Sooter

Good insights! Thanks for sharing! A strategy I'll be trying in the future.

Serge

Even though Regis can be persistent, you can still lose data on crash. It does not flush data immediately on the disk on writes. Keep that in mind.

Peter Bengtsson

Definitely a feather in PG's hat. And much the reason why I'm nervous about *relying* on Redis as a persistent store. I'm more and more inclined to use Redis as a caching layer.

Kyle Harrison

I mean, the thing is that postgre is storing this data to disk, and reading and writing to disk. In other words a Persistent Datastore. Meaning data will survive a reboot of the service.
Redis is a pure memory storage, it reads and writes nothing to disk. In other words it's a Volatile Datadstore. Meaning data will not survive a reboot of the service.

They serve two entirely different purposes. Redis is purely a cache server. So it stands to reason that of course it's going to be insanely fast at what it does.

But hey, want something even faster? Check out KeyDB. It's a fork of redis and I believe drop in compatible. But it uses Multithreading to do the work, where Redis stubbornly stays single thread.

Peter Bengtsson

Redis is persistent. Perhaps you were thinking of Memcached.

Kyle Harrison

The default Snapshotting is hardly ideal for what one would call "persistent". One shouldn't be storing mission critical data in a cache server in hopes it'll survive. Data in stores like Redis and especially memcached should be considered and are volatile at all times. All Redis does differently from Memcached is ocassionally (and conditionally) dump whats in it into a single file for backup. Really, if you need your key to survive and don't care about performance as much, stick to your regular database solution, it will absolutely be safer there.

Tom Dyson

> Redis is a pure memory storage, it reads and writes nothing to disk. In other words it's a Volatile Datadstore. Meaning data will not survive a reboot of the service

This is straightforwardly wrong. See https://redis.io/topics/persistence.

> The default Snapshotting is hardly ideal for what one would call "persistent".

So... don't use the default snapshotting.

Kyle Harrison

> So... don't use the default snapshotting.

and you propose as to................ what? AOF? The thing that on the same page you linked too describes it as buggy and unreliable? Because that's the only other option for Redis here. Persistance is simply not Redis' strength, plain and simple.

It's a cache server. Treat it like one and everyone will be happy. Stop trying to use it like Mongo.

Neil Goldman

You said yourself the redis is not properly configured. It takes a decent amount of work to properly configure redis to be durable. Otherwise it is not.

Kyle Harrison

You know, I simply can't think of a scenario where I'd even want Redis to be "durable". It's a great server to spin up and immediately start storing serialized values into. Building into the application layer the reliance on refreshing that key when expired or missing.

For everything else I would care about if lost to a restart, I'd store in a normal database that properly respects ACID transactions.

Can someone spoon feed me some scenarios where having _redis_ persistence is actually a desireable thing? what's the point of sacrificing the speed (what it's good at) for AOF mode, especially if it's unreliable enough for Redis' docs to make note of anyways?

Peter Bengtsson

Yeah, it's rare if you use Redis generally as a caching pattern.

The one case would be if the Redis is flushed (corrupt restart, or FLUSHALL command) and it causes a stampeding herd on the backend that the cache is supposed to protect.
For example, a lot of web apps use something like Redis to store use session cookie values (e.g. https://docs.djangoproject.com/en/2.2/ref/settings/#std:setting-SESSION_ENGINE in Django) and losing the cache would sign everyone out which would suck. But even there, there are choices, such as the `cached_db` option in Django which *writes* to both, but then mostly *reads* from the cache.

Neil Goldman

I'm not saying whether or not you want to use Redis durably, just that if you're comparing Postgres vs Redis and not properly configuring Redis to be durable, it's not a very valid comparison. Likewise, I can't think of a situation where I'd prefer Postgres acting as an in-memory cache vs Redis.

Marco Ceppi

This simply isn't true. Redid can flush memory to disk and boot with that disk image. We use Redis and psql for persistent data stores without much issue. Writing to disk is an async process and typically doesn't disrupt performance unless it's a very large data set.

Anonymous

Kyle Harrison, the lax bro from John Hopkins?

Kyle Harrison

lol, sadly no. Wouldn't that be hilarious though? We've exchanged tweets though! Super chill guy

Aaron

Are the two databases both filled to the same size? There might be a scale difference if Postgres is looking through logs more data. And also the indexing that is done. Straight speed has tons of context.

Peter Bengtsson

No, they were not. PG had about 3GB worth of data. The Redis was only a couple of hundred megabytes.

Lookup by primary key is always indexed.

Anonymous

What JSON field type did you use in Postgres? JSON or JSONB?

Peter Bengtsson

I don't know what "JSON" is. But JSONB is the data type.

Ozz Nixon

Ridiculous Comparison!

REDIS is RAM based... PG or most every other SQL server is DISK based.

Peter Bengtsson

Not really. Redis is disk persistent too. Nothing is lost of your turn it off and restart the server.

Ozz Nixon

Redis transactions are not fully ACID compliant ( Atomicity, Consistency, Isolation, and Durability). If ACID transactions are expected, Redis is not the perfect fit and should not be used. An RDBMS or another database system should be used in those scenarios.
Feb 25, 2019

Just because REDIS can "Flush to disk" in the background - does not make it an RDBMS system. This technique is how they can say "Disk persistent"... but, if you add to the DB and pull the power - at the exact moment the data parser acknowledges the add - REDIS will lose the data period. This is now to say REDIS sucks, just that you are comparing a RAM DB vs a DISK DB.

Konstantin Gredeskoul

Seems not entirely appropriate to be comparing an in-memory only database (redis) to a proper transactional database, which with the default configuration requires to fscync and actually acknowledge the data is written to disk (you can change those settings btw).

They ate both fantastic tools, but just as I wouldn’t use PostgreSQL as a cache, I probably wouldn’t use redis as a transactional RDBMS.

Peter Bengtsson

fsync is only applicable when it comes to writes.

I'm not using PostgreSQL as a cache. It's being used in a pretty persistent way.

Morris de Oryx

An in-memory cache *should* be faster, and such tools absolutely have a place. A tool such as Redis can make an unusable system great, under the right conditions. It would be cool if Postgres added in-memory tables in V13 or V14.

In this case, unless the work is compounded in a loop, even the slowest times are *imperceptible* to a human being:

https://www.nngroup.com/articles/response-times-3-important-limits/

Put another way, some of your results are "20x faster" to a computer and imperceptibly different to a person.

Peter Bengtsson

The numbers add up and it's nice to eliminate slow things that add up when the application actually does a little bit more that just that one call.

Denique De Nique

The problem with your response is that human perception isn't the only factor. Another thing to consider is the cost at scale of grinding away for an extra 10x or 20x the time. A lot of infrastructure, particularly cloud infrastructure, is sensitive to this. So you can save a considerable amount of money by saving 20x the computation even if it isn't always happening in a loop.

Konstantin Gredeskoul

You can always add an in-memory file system and mount some postgresql tables on that partition if you wanted to test in memory speed of postgresql.

Or you could enable delayed_commit, and batch many transactions into a single fsync every 10 seconds.

Then the numbers will be closer to Redis.

Konstantin Gredeskoul

One additional point about relying on Redis for high performance writes: while postgresql supports concurrent writes via many simultaneous connections and resolves any conflicts that may arise, Redis is single-threaded and can only process a single command per server instance at a time. Therefore redis will peak at some high number of OPS and then fall over, blocking all operations.

Peter Bengtsson

Excellent point.
That's the kinda of thoughts that are important. Both databases can do the job. They have their different pros and cons. This particular blog post focussed a lot on just one of those: speed.

Joseph Locke

I feel like finishing a write before piping off a response to the user is an unfortunate way of handling things.

Matteo Pasquini

Ok, djago+whathever db just works.
Postresql is behind db,(ok, I'm a fan!!) there are features that django simply cannot manage.
Seeking for performances in postresql I'd use triggered partitioning, triggered retention policies and fine indexing with index space on dedicated ssd ... Well dba stuffs :)
About consistency, streaming replica, django that exploit switch on r/wand switchover failure.
About 'in memory tables' PG has, as many others, prepared transactions, I've found great benefits with tables over 1Tb (Pg 9.3, not partitioned)
...can Redis do that..? Gess it need Tbs of Ram..
Cheers.

Eli

does the redis blob still require a json parse afterwards?

Peter Bengtsson

The cryptic thing is that there actually isn't JSON on the Redis side. It's a dict. And my Redis serializer, in the Redis driver, is msgpack.

Anonymous

Dude, you should ask antirez, the creator of redis. Let me explain: when redis persistence is like postgresql, it has same speed or slower (writing).
http://oldblog.antirez.com/post/redis-persistence-demystified.html

Anonymous

Typo: "tl;dr; Redis is 16 times faster and reading these JSON blobs.*" ->
"tl;dr; Redis is 16 times faster at reading these JSON blobs.*"

Peter Bengtsson

Thanks! Fixed now.

anon

I'd be interested if you could try sqlite3 which gets rid of all the interprocess traffic since it is loaded into the python process.

Peter Bengtsson

Oh yes that would be neat. But does that mean that if you use sqlite3 that its entire memory is loaded into each uwsgi Python process (aka. worker)?

Anonymous

This seems like an apples and oranges comparison. You're using the JSONB type in postgres, which allows you to index and query specific fields in the blob, and your testing doesn't seem to be doing that at all. A fairer comparison would be to use postgres' text type, or a separate comparison that does query for specific fields with both.

Peter Bengtsson

Yes, it is apples and oranges but they are both things you can use for a juicy and healthy snack. You don't have to turn the apple into an orange (or the other way around) but it's about understanding the pros and cons of the apples and the pros and cons of the oranges. Then, equipped with that you can make informed decisions.

Anonymous

You didn't address what I said at all.

Anonymous

One other thing to keep in mind is that in most configurations I have seen the connection to postgres is encrypted, while the ones to Redis arn't. while most of the difference here is likely due to RAM vs DISK read/writes. some maybe due to the overhead of TLS connections.

tyler neely

This article is missing a ton of crucial details for a real storage decision. Redis has a particularly wasteful storage approach because it rewrites everything periodically. Postgres and many other systems are able to avoid moving old data as frequently.

What about persistence guarantees? What about replication requirements? What about backup effort? The list goes on and on. "Faster" means different things when you're talking about workloads that require low latency vs workloads that require high throughput.

A network-attached kv skips a lot of the work that an actual database performs. Maybe that's work that you really don't need, but maybe you'll learn after an outage that you wish you had it.

There is no substitute for measuring real workloads on real hardware.

I feel like this article is just trolling database authors, operators, and users, right?

Peter Bengtsson

Thanks for that mansplaining. I had no idea that there are more nuances to this.

Nicolas Grilly

You are probably establishing a new connection to PG for each request. Have you set CONN_MAX_AGE?

Peter Bengtsson

Yep, the connections persist.

René Dudfield

Good post (as usual), and interesting comments. I made a response post with some pg optimization tips if anyone is interested: https://renesd.blogspot.com/2019/10/using-postgresql-as-cache.html cheers!

Peter Bengtsson

You get it. I like you.

Dave

Did you try creating your PG database on /dev/shm (ramdisk) to eliminate the disk as a bottleneck? I'd be interested to see the difference in performance...

Peter Bengtsson

No, because the PG is used for a bunch of other data.

Vance

Personally I think Redis is the right solution for this since all you are doing is caching. You don't need all the data safety that postgres supplies. And that is what that performance difference is about. Redis is a wonderful tool for a great many things, but it will loose data even if you are syncing, plus look at the memory requirements of the sync to disk.

There are number of the factors here that make this an invalid comparious. You say you are caching json, but that's not what it looks like to me. It looks like MsgPack in Redis, a Table in postgres. Postgres has a JSON field, and you could just as easily use that with an indexed key. Or you cold us a blob field and serialize MsgPack into it then look it up by key like you do in redis. You'd get better performance.

Redis priority is performance. PostgreSQL priority is data integrity. Use the right tool for the job, but please don't compare apples to oranges and talk about x times multipliers.

pyros2097

I would suggest you use BoltDB best of both worlds.

Connor McDonald

Curious - what is the typical size of the JSON you are reading/storing?

Felix Buenemann

Have you considered just using an http cache like NGINX reverse proxy with proxy cache module?

This makes the most sense if you process the data on the client side, since you can skip the entire server side stack.

Your email will never ever be published

Related posts

Previous:
uwsgi weirdness with --http 19 September 2019
Next:
Update to speed comparison for Redis vs PostgreSQL storing blobs of JSON 30 September 2019
Related by Keyword:
Build an XML sitemap of XML sitemaps 01 June 2019
Django ORM optimization story on selecting the least possible 22 February 2019
Fastest *local* cache backend possible for Django 04 August 2017
Fastest Redis configuration for Django 11 May 2017
Fastest cache backend possible for Django 07 April 2017