How much faster is Redis at storing a blob of JSON compared to PostgreSQL?
September 28, 2019
65 comments Python, PostgreSQL, Redis
tl;dr; Redis is 16 times faster at reading these JSON blobs.
In Song Search when you've found a song, it loads some affiliate links to Amazon.com. (In case you're curious it's earning me lower double-digit dollars per month). To avoid overloading the Amazon Affiliate Product API, after I've queried their API, I store that result in my own database along with some metadata. Then, the next time someone views that song page, it can read from my local database. With me so far?
The other caveat is that you can't store these lookups locally too long since prices change and/or results change. So if my own stored result is older than a couple of hundred days, I delete it and fetch from the network again. My current implementation uses PostgreSQL (via the Django ORM) to store this stuff. The model looks like this:
class AmazonAffiliateLookup(models.Model, TotalCountMixin):
song = models.ForeignKey(Song, on_delete=models.CASCADE)
matches = JSONField(null=True)
search_index = models.CharField(max_length=100, null=True)
lookup_seconds = models.FloatField(null=True)
created = models.DateTimeField(auto_now_add=True, db_index=True)
modified = models.DateTimeField(auto_now=True)
At the moment this database table is 3GB on disk.
Then, I thought, why not use Redis for this. Then I can use Redis's "natural" expiration by simply setting as expiry time when I store it and then I don't have to worry about cleaning up old stuff at all.
The way I'm using Redis in this project is as a/the cache backend and I have it configured like this:
CACHES = {
"default": {
"BACKEND": "django_redis.cache.RedisCache",
"LOCATION": REDIS_URL,
"TIMEOUT": config("CACHE_TIMEOUT", 500),
"KEY_PREFIX": config("CACHE_KEY_PREFIX", ""),
"OPTIONS": {
"COMPRESSOR": "django_redis.compressors.zlib.ZlibCompressor",
"SERIALIZER": "django_redis.serializers.msgpack.MSGPackSerializer",
},
}
}
The speed difference
Perhaps unrealistic but I'm doing all this testing here on my MacBook Pro. The connection to Postgres (version 11.4) and Redis (3.2.1) are both on localhost
.
Reads
The reads are the most important because hopefully, they happen 10x more than writes as several people can benefit from previous saves.
I changed my code so that it would do a read from both databases and if it was found in both, write down their time in a log file which I'll later summarize. Results are as follows:
PG: median: 8.66ms mean : 11.18ms stdev : 19.48ms Redis: median: 0.53ms mean : 0.84ms stdev : 2.26ms (310 measurements)
It means, when focussing on the median, Redis is 16 times faster than PostgreSQL at reading these JSON blobs.
Writes
The writes are less important but due to the synchronous nature of my Django, the unlucky user who triggers a look up that I didn't have, will have to wait for the write before the XHR request can be completed. However, when this happens, the remote network call to the Amazon Product API is bound to be much slower. Results are as follows:
PG: median: 8.59ms mean : 8.58ms stdev : 6.78ms Redis: median: 0.44ms mean : 0.49ms stdev : 0.27ms (137 measurements)
It means, when focussing on the median, Redis is 20 times faster than PostgreSQL at writing these JSON blobs.
Conclusion and discussion
First of all, I'm still a PostgreSQL fan-boy and have no intention of ceasing that. These times are made up of much more than just the individual databases. For example, the PostgreSQL speeds depend on the Django ORM code that makes the SQL and sends the query and then turns it into the model instance. I don't know what the proportions are between that and the actual bytes-from-PG's-disk times. But I'm not sure I care either. The tooling around the database is inevitable mostly and it's what matters to users.
Both Redis and PostgreSQL are persistent and survive server restarts and crashes etc. And you get so many more "batch related" features with PostgreSQL if you need them, such as being able to get a list of the last 10 rows added for some post-processing batch job.
I'm currently using Django's cache framework, with Redis as its backend, and it's a cache framework. It's not meant to be a persistent database. I like the idea that if I really have to I can just flush the cache and although detrimental to performance (temporarily) it shouldn't be a disaster. So I think what I'll do is store these JSON blobs in both databases. Yes, it means roughly 6GB of SSD storage but it also potentially means loading a LOT more into RAM on my limited server. That extra RAM usage pretty much sums of this whole blog post; of course it's faster if you can rely on RAM instead of disk. Now I just need to figure out how RAM I can afford myself for this piece and whether it's worth it.
UPDATE September 29, 2019
I experimented with an optimization of NOT turning the Django ORM query into a model instance for each record. Instead, I did this:
+from dataclasses import dataclass
+@dataclass
+class _Lookup:
+ modified: datetime.datetime
+ matches: list
...
+base_qs = base_qs.values_list("modified", "matches")
-lookup = base_qs.get(song__id=song_id)
+lookup_tuple = base_qs.get(song__id=song_id)
+lookup = _Lookup(*lookup_tuple)
print(lookup.modified)
Basically, let the SQL driver's "raw Python" content come through the Django ORM. The old difference between PostgreSQL and Redis was 16x. The new difference was 14x instead.
uwsgi weirdness with --http
September 19, 2019
2 comments Python, Linux
Instead of upgrading everything on my server, I'm just starting from scratch. From Ubuntu 16.04 to Ubuntu 19.04 and I also upgraded everything else in sight. One of them was uwsgi
. I copied various user config files but for uwsgi
things didn't very well. On the old server I had uwsgi
version 2.0.12-debian
and on the new one 2.0.18-debian
. The uWSGI changelog is pretty hard to read but I sure don't see any mention of this.
You see, on SongSearch I have it so that Nginx talks to Django via a uWSGI socket. But the NodeJS server talks to Django via 127.0.0.1:PORT
. So I need my uWSGI config to start both. Here was the old config:
[uwsgi] plugins = python35 virtualenv = /var/lib/django/songsearch/venv pythonpath = /var/lib/django/songsearch user = django uid = django master = true processes = 3 enable-threads = true touch-reload = /var/lib/django/songsearch/uwsgi-reload.touch http = 127.0.0.1:9090 module = songsearch.wsgi:application env = LANG=en_US.utf8 env = LC_ALL=en_US.UTF-8 env = LC_LANG=en_US.UTF-8
(The only difference on the new server was the python37
plugin instead)
I start it and everything looks fine. No errors in the log files. And netstat
looks like this:
# netstat -ntpl | grep 9090 tcp 0 0 127.0.0.1:9090 0.0.0.0:* LISTEN 1855/uwsgi
But every time I try to curl localhost:9090
I kept getting curl: (52) Empty reply from server
. Nothing in the log files! It seemed no matter what I tried I just couldn't talk to it over HTTP. No, I'm not a sysadmin. I'm just a hobbyist trying to stand up my little server with the tools and limited techniques I know but I was stumped.
The solution
After endless Googling for a resolution and trying all sorts of uwsgi
commands directly, I somehow stumbled on the solution.
[uwsgi]
plugins = python35
virtualenv = /var/lib/django/songsearch/venv
pythonpath = /var/lib/django/songsearch
user = django
uid = django
master = true
processes = 3
enable-threads = true
touch-reload = /var/lib/django/songsearch/uwsgi-reload.touch
-http = 127.0.0.1:9090
+http-socket = 127.0.0.1:9090
module = songsearch.wsgi:application
env = LANG=en_US.utf8
env = LC_ALL=en_US.UTF-8
env = LC_LANG=en_US.UTF-8
With this one subtle change, I can now curl localhost:9090
and I still have the /var/run/uwsgi/app/songsearch/socket
socket. So, yay!
I'm blogging about this in case someone else ever gets stuck in the same nasty surprise as me.
Also, I have to admit, I was fuming with rage from this frustration. It's really inspired me to revive the quest for an alternative to uwsgi
because I'm not sure it's that great anymore. There are new alternatives such as gunicorn
, gunicorn
with Meinheld
, bjoern
etc.
Fastest Python function to slugify a string
September 12, 2019
4 comments Python
In MDN I noticed a function that turns a piece of text (Python 2 unicode) into a slug. It looks like this:
non_url_safe = ['"', '#', '$', '%', '&', '+',
',', '/', ':', ';', '=', '?',
'@', '[', '\\', ']', '^', '`',
'{', '|', '}', '~', "'"]
def slugify(self, text):
"""
Turn the text content of a header into a slug for use in an ID
"""
non_safe = [c for c in text if c in self.non_url_safe]
if non_safe:
for c in non_safe:
text = text.replace(c, '')
# Strip leading, trailing and multiple whitespace, convert remaining whitespace to _
text = u'_'.join(text.split())
return text
The code is 7-8 years old and relates to a migration when MDN was created as a Python fork from an existing PHP solution.
I couldn't help but to react to the fact that it's a list and it's looped over every single time. Twice, in a sense. Python has built-in tools for this kinda stuff. Let's see if I can make it faster.
The candidates
translate_table = {ord(char): u'' for char in non_url_safe}
non_url_safe_regex = re.compile(
r'[{}]'.format(''.join(re.escape(x) for x in non_url_safe)))
def _slugify1(self, text):
non_safe = [c for c in text if c in self.non_url_safe]
if non_safe:
for c in non_safe:
text = text.replace(c, '')
text = u'_'.join(text.split())
return text
def _slugify2(self, text):
text = text.translate(self.translate_table)
text = u'_'.join(text.split())
return text
def _slugify3(self, text):
text = self.non_url_safe_regex.sub('', text).strip()
text = u'_'.join(re.split(r'\s+', text))
return text
I wrote a thing that would call each one of the candidates, assert that their outputs always match and store how long each one took.
The results
The slowest is fast enough. But if you're still reading, here are the results:
_slugify1 0.101ms _slugify2 0.019ms _slugify3 0.033ms
So using a translate table is 5 times faster. And a regex 3 times faster. But they're all sufficiently fast.
Conclusion
This is the least of your problems in a world of real I/O such as databases and other genuinely CPU intense stuff. Well, it was fun little side-trip.
Also, aren't there better solutions that just blacklist all control characters?
Train your own spell corrector with TextBlob
August 23, 2019
0 comments Python
TextBlob is a wonderful Python library it. It wraps nltk
with a really pleasant API. Out of the box, you get a spell-corrector. From the tutorial:
>>> from textblob import TextBlob
>>> b = TextBlob("I havv goood speling!")
>>> str(b.correct())
'I have good spelling!'
The way it works is that, shipped with the library, is this text file: en-spelling.txt It's about 30,000 lines long and looks like this:
;;; Based on several public domain books from Project Gutenberg ;;; and frequency lists from Wiktionary and the British National Corpus. ;;; http://norvig.com/big.txt ;;; a 21155 aah 1 aaron 5 ab 2 aback 3 abacus 1 abandon 32 abandoned 72 abandoning 27
That gave me an idea! How about I use the TextBlob
API but bring my own text as the training model. It doesn't have to be all that complicated.
The challenge
(Note: All the code I used for this demo is available here: github.com/peterbe/spellthese)
I found this site that lists "Top 1,000 Baby Boy Names". From that list, randomly pick a couple of out and mess with their spelling. Like, remove letters, add letters, and swap letters.
So, 5 random names now look like this:
▶ python challenge.py RIGHT: jameson TYPOED: jamesone RIGHT: abel TYPOED: aabel RIGHT: wesley TYPOED: welsey RIGHT: thomas TYPOED: thhomas RIGHT: bryson TYPOED: brysn
Imagine some application, where fat-fingered users typo those names on the right-hand side, and your job is to map that back to the correct spelling.
First, let's use the built in TextBlob.correct
. A bit simplified but it looks like this:
from textblob import TextBlob
correct, typo = get_random_name()
b = TextBlob(typo)
result = str(b.correct())
right = correct == result
...
And the results:
▶ python test.py ORIGIN TYPO RESULT WORKED? jesus jess less Fail austin ausin austin Yes! julian juluian julian Yes! carter crarter charter Fail emmett emett met Fail daniel daiel daniel Yes! luca lua la Fail anthony anthonyh anthony Yes! damian daiman cabman Fail kevin keevin keeping Fail Right 40.0% of the time
Buuh! Not very impressive. So what went wrong there? Well, the word met
is much more common than emmett
and the same goes for words like less
, charter
, keeping
etc. You know, because English.
The solution
The solution is actually really simple. You just crack open the classes out of textblob
like this:
from textblob import TextBlob
from textblob.en import Spelling
path = "spelling-model.txt"
spelling = Spelling(path=path)
# Here, 'names' is a list of all the 1,000 correctly spelled names.
# e.g. ['Liam', 'Noah', 'William', 'James', ...
spelling.train(" ".join(names), path)
Now, instead of corrected = str(TextBlob(typo).correct())
we do result = spelling.suggest(typo)[0][0]
as demonstrated here:
correct, typo = get_random_name()
b = spelling.suggest(typo)
result = b[0][0]
right = correct == result
...
So, let's compare the two "side by side" and see how this works out. Here's the output of running with 20 randomly selected names:
▶ python test.py UNTRAINED... ORIGIN TYPO RESULT WORKED? juan jaun juan Yes! ethan etha the Fail bryson brysn bryan Fail hudson hudsn hudson Yes! oliver roliver oliver Yes! ryan rnyan ran Fail cameron caeron carron Fail christopher hristopher christopher Yes! elias leias elias Yes! xavier xvaier xvaier Fail justin justi just Fail leo lo lo Fail adrian adian adrian Yes! jonah ojnah noah Fail calvin cavlin calvin Yes! jose joe joe Fail carter arter after Fail braxton brxton brixton Fail owen wen wen Fail thomas thoms thomas Yes! Right 40.0% of the time TRAINED... ORIGIN TYPO RESULT WORKED? landon landlon landon Yes sebastian sebstian sebastian Yes evan ean ian Fail isaac isaca isaac Yes matthew matthtew matthew Yes waylon ywaylon waylon Yes sebastian sebastina sebastian Yes adrian darian damian Fail david dvaid david Yes calvin calivn calvin Yes jose ojse jose Yes carlos arlos carlos Yes wyatt wyatta wyatt Yes joshua jsohua joshua Yes anthony antohny anthony Yes christian chrisian christian Yes tristan tristain tristan Yes theodore therodore theodore Yes christopher christophr christopher Yes joshua oshua joshua Yes Right 90.0% of the time
See, with very little effort you can got from 40% correct to 90% correct.
Note, that the output of something like spelling.suggest('darian')
is actually a list like this: [('damian', 0.5), ('adrian', 0.5)]
and you can use that in your application. For example:
<li><a href="?name=damian">Did you mean <b>damian</b></a></li> <li><a href="?name=adrian">Did you mean <b>adrian</b></a></li>
Bonus and conclusion
Ultimately, what TextBlob
does is a re-implementation of Peter Norvig's original implementation from 2007. I too, have written my own implementation in 2007. Depending on your needs, you can just figure out the licensing of that source code and lift it out and implement in your custom ways. But TextBlob
wraps it up nicely for you.
When you use the textblob.en.Spelling
class you have some choices. First, like I did in my demo:
path = "spelling-model.txt"
spelling = Spelling(path=path)
spelling.train(my_space_separated_text_blob, path)
What that does is creating a file spelling-model.txt
that wasn't there before. It looks like this (in my demo):
▶ head spelling-model.txt aaron 1 abel 1 adam 1 adrian 1 aiden 1 alexander 1 andrew 1 angel 1 anthony 1 asher 1
The number (on the right) there is the "frequency" of the word. But what if you have a "scoring" number of your own. Perhaps, in your application you just know that adrian
is more right than damian
. Then, you can make your own file:
Suppose the text file ("spelling-model-weighted.txt") contains lines like this:
... adrian 8 damian 3 ...
Now, the output becomes:
>>> import os >>> from textblob.en import Spelling >>> import os >>> path = "spelling-model-weighted.txt" >>> assert os.path.isfile(path) >>> spelling = Spelling(path=path) >>> spelling.suggest('darian') [('adrian', 0.7272727272727273), ('damian', 0.2727272727272727)]
Based on the weighting, these numbers add up. I.e. 3 / (3 + 8) == 0.2727272727272727
I hope it inspires you to write your own spelling application using TextBlob
.
For example, you can feed it the names of your products on an e-commerce site. The .txt
file might bloat if you have too much but note that the 30K lines en-spelling.txt
is only 314KB and it loads in...:
>>> from textblob import TextBlob >>> from time import perf_counter >>> b = TextBlob("I havv goood speling!") >>> t0 = perf_counter(); right = b.correct() ; t1 = perf_counter() >>> t1 - t0 0.07055813199999861
...70ms for 30,000 words.
SongSearch autocomplete rate now 2+ per second
July 11, 2019
0 comments Django, Python, Nginx, Redis
By analyzing my Nginx logs, I've concluded that SongSearch's autocomplete JSON API now gets about 2.2 requests per second. I.e. these are XHR requests to /api/search/autocomplete?q=...
.
Roughly, 1.8 requests per second goes back to the Django/Elasticsearch backend. That's a hit ratio of 16%. These Django/Elasticsearch requests take roughly 200ms on average. I suspect about 150-180ms of that time is spent querying Elasticsearch, the rest being Python request/response and JSON "paperwork".
Caching strategy
Caching is hard because the queries are so vastly different over time. Had I put a Redis cache decorator on the autocomplete Django view function I'd quickly bloat Redis memory and cause lots of evictions.
What I used to do was something like this:
def search_autocomplete(request):
q = request.GET.get('q')
cache_key = None
if len(q) < 10:
cache_key = 'autocomplete:' + q
results = cache.get(cache_key)
if results is not None:
return http.JsonResponse(results)
results = _do_elastisearch_query(q)
if cache_key:
cache.set(cache_key, results, 60 * 60)
return http.JsonResponse(results)
However, after some simple benchmarking it was clear that using Nginx' uwsgi_cache
it was much faster to let the cacheable queries terminate already at Nginx. So I changed the code to something like this:
def search_autocomplete(request):
q = request.GET.get('q')
results = _do_elastisearch_query(q)
response = http.JsonResponse(results)
if len(q) < 10:
patch_cache_control(response, public=True, max_age=60 * 60)
return response
The only annoying thing about Nginx caching is that purging is hard unless you go for that Nginx Plus (or whatever their enterprise version is called). But more annoying, to me, is that fact that I can't really see what this means for my server. When I was caching with Redis I could just use redis-cli
and...
> INFO ... # Memory used_memory:123904288 used_memory_human:118.16M ...
Nginx Amplify
My current best tool for keeping an eye on Nginx is Nginx Amplify. It gives me some basic insights about the state of things. Here are some recent screenshots:
Thoughts and conclusion
Caching is hard. But it's also fun because it ties directly into performance work.
In my business logic, I chose that autocomplete queries that are between 1 and 9 characters are cacheable. And I picked a TTL of 60 minutes. At this point, I'm not sure exactly why I chose that logic but I remember doing some back-of-envelope calculations about what the hit ratio would be and roughly what that would mean in bytes in RAM. I definitely remember picking 60 minutes because I was nervous about bloating Nginx's memory usage. But as of today, I'm switching that up to 24 hours and let's see what that does to my current 16% Nginx cache hit ratio. At the moment, /var/cache/nginx-cache/
is only 34MB which isn't much.
Another crux with using uwsgi_cache
(or proxy_cache
) is that you can't control the cache key very well. When it was all in Python I was able to decide about the cache key myself. A plausible implementation is cache_key = q.lower().strip()
for example. That means you can protect your Elasticsearch backend from having to do {"q": "A"}
and {"q": "a"}
. Who knows, perhaps there is a way to hack this in Nginx without compiling in some Lua engine.
The ideal would be some user-friendly diagnostics tool that I can point somewhere, towards Nginx, that says how much my uwsgi_cache
is hurting or saving me. Autocomplete is just one of many things going on on this single DigitalOcean server. There's also a big PostgreSQL server, a node-express cluster, a bunch of uwsgi
workers, Redis, lots of cron job scripts, and of course a big honking Elasticsearch 6.
UPDATE (July 12 2019)
Currently, and as mentioned above, I only set Cache-Control
headers (which means Nginx snaps it up) for queries that at max 9 characters long. I wanted to appreciate and understand how ratio of all queries are longer than 9 characters so I wrote a report and its output is this:
POINT: 7 Sum show 75646 32.2% Sum rest 159321 67.8% POINT: 8 Sum show 83702 35.6% Sum rest 151265 64.4% POINT: 9 Sum show 90870 38.7% Sum rest 144097 61.3% POINT: 10 Sum show 98384 41.9% Sum rest 136583 58.1% POINT: 11 Sum show 106093 45.2% Sum rest 128874 54.8% POINT: 12 Sum show 113905 48.5% Sum rest 121062 51.5%
It means that (independent of time expiry) 38.7% of queries are 9 characters or less.
Build an XML sitemap of XML sitemaps
June 1, 2019
0 comments Django, Python
Suppose that you have so many thousands of pages that you can't just create a single /sitemap.xml
file that has all the URLs (aka <loc>
) listed. Then you need to make a /sitemaps.xml
that points to the other sitemap files. And if you're in the thousands, you'll need to gzip these files.
The blog post demonstrates how Song Search generates a sitemap file that points to 63 sitemap-{M}-{N}.xml.gz
files which spans about 1,000,000 URLs. The context here is Python and the getting of the data is from Django. Python is pretty key here but if you have something other than Django, you can squint and mentally replace that with your own data mapper.
Generate the sitemap .xml.gz
file(s)
Here's the core of the work. A generator function that takes a Django QuerySet instance (that is ordered and filtered!) and then starts generating etree
trees and dumps them to disk with gzip
.
import gzip
from lxml import etree
outfile = "sitemap-{start}-{end}.xml"
batchsize = 40_000
def generate(self, qs, base_url, outfile, batchsize):
# Use `.values` to make the query much faster
qs = qs.values("name", "id", "artist_id", "language")
def start():
return etree.Element(
"urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
)
def close(root, filename):
with gzip.open(filename, "wb") as f:
f.write(b'<?xml version="1.0" encoding="utf-8"?>\n')
f.write(etree.tostring(root, pretty_print=True))
root = filename = None
count = 0
for song in qs.iterator():
if not count % batchsize:
if filename: # not the very first loop
close(root, filename)
yield filename
filename = outfile.format(start=count, end=count + batchsize)
root = start()
loc = "{}{}".format(base_url, make_song_url(song))
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
count += 1
close(root, filename)
yield filename
The most important lines in terms of lxml.etree
and sitemaps are:
root = etree.Element("urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
...
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
Another important thing is the note about using .values()
. If you don't do that Django will create a model instance for every single row it returns of the iterator. That's expensive. See this blog post.
Another important thing is to use a Django ORM iterator as that's much more efficient than messing around with limits and offsets.
Generate the map of sitemaps
Making the map of maps doesn't need to be gzipped since it's going to be tiny.
def generate_map_of_maps(base_url, outfile):
root = etree.Element(
"sitemapindex", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
)
with open(outfile, "wb") as f:
f.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
files_created = sorted(glob("sitemap-*.xml.gz"))
for file_created in files_created:
sitemap = etree.SubElement(root, "sitemap")
uri = "{}/{}".format(base_url, os.path.basename(file_created))
etree.SubElement(sitemap, "loc").text = uri
lastmod = datetime.datetime.fromtimestamp(
os.stat(file_created).st_mtime
).strftime("%Y-%m-%d")
etree.SubElement(sitemap, "lastmod").text = lastmod
f.write(etree.tostring(root, pretty_print=True))
And that sums it up. On my laptop, it takes about 60 seconds to generate 39 of these files (e.g. sitemap-1560000-1600000.xml.gz
) and that's good enough.
Bonus and Thoughts
The bad news is that this is about as good as it gets in terms of performance. The good news is that there are no low-hanging fruit fixes. I know, because I tried. I experimented with not using pretty_print=True
and I experimented with not writing with gzip.open
and instead gzipping the files on later. Nothing made any significant difference. The lxml.etree
part of this, in terms of performance, is order of maginitude marginal in comparison to the cost of actually getting the data out of the database plus later writing to disk. I also experimenting with generating the gzip content with zopfli
and it didn't make much of a difference.
I originally wrote this code years ago and when I did, I think I knew more about sitemaps. In my implementation I use a batch size of 40,000 so each file is called something like sitemap-40000-80000.xml.gz
and weighs about 800KB. Not sure why I chose 40,000 but perhaps not important.