Filtered by Python

Page 6

Reset

How much faster is Redis at storing a blob of JSON compared to PostgreSQL?

September 28, 2019
65 comments Python, PostgreSQL, Redis

tl;dr; Redis is 16 times faster at reading these JSON blobs.

In Song Search when you've found a song, it loads some affiliate links to Amazon.com. (In case you're curious it's earning me lower double-digit dollars per month). To avoid overloading the Amazon Affiliate Product API, after I've queried their API, I store that result in my own database along with some metadata. Then, the next time someone views that song page, it can read from my local database. With me so far?

Example view of affiliate links

The other caveat is that you can't store these lookups locally too long since prices change and/or results change. So if my own stored result is older than a couple of hundred days, I delete it and fetch from the network again. My current implementation uses PostgreSQL (via the Django ORM) to store this stuff. The model looks like this:


class AmazonAffiliateLookup(models.Model, TotalCountMixin):
    song = models.ForeignKey(Song, on_delete=models.CASCADE)
    matches = JSONField(null=True)
    search_index = models.CharField(max_length=100, null=True)
    lookup_seconds = models.FloatField(null=True)
    created = models.DateTimeField(auto_now_add=True, db_index=True)
    modified = models.DateTimeField(auto_now=True)

At the moment this database table is 3GB on disk.

Then, I thought, why not use Redis for this. Then I can use Redis's "natural" expiration by simply setting as expiry time when I store it and then I don't have to worry about cleaning up old stuff at all.

The way I'm using Redis in this project is as a/the cache backend and I have it configured like this:


CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": REDIS_URL,
        "TIMEOUT": config("CACHE_TIMEOUT", 500),
        "KEY_PREFIX": config("CACHE_KEY_PREFIX", ""),
        "OPTIONS": {
            "COMPRESSOR": "django_redis.compressors.zlib.ZlibCompressor",
            "SERIALIZER": "django_redis.serializers.msgpack.MSGPackSerializer",
        },
    }
}

The speed difference

Perhaps unrealistic but I'm doing all this testing here on my MacBook Pro. The connection to Postgres (version 11.4) and Redis (3.2.1) are both on localhost.

Reads

The reads are the most important because hopefully, they happen 10x more than writes as several people can benefit from previous saves.

I changed my code so that it would do a read from both databases and if it was found in both, write down their time in a log file which I'll later summarize. Results are as follows:

PG:
median: 8.66ms
mean  : 11.18ms
stdev : 19.48ms

Redis:
median: 0.53ms
mean  : 0.84ms
stdev : 2.26ms

(310 measurements)

It means, when focussing on the median, Redis is 16 times faster than PostgreSQL at reading these JSON blobs.

Writes

The writes are less important but due to the synchronous nature of my Django, the unlucky user who triggers a look up that I didn't have, will have to wait for the write before the XHR request can be completed. However, when this happens, the remote network call to the Amazon Product API is bound to be much slower. Results are as follows:

PG:
median: 8.59ms
mean  : 8.58ms
stdev : 6.78ms

Redis:
median: 0.44ms
mean  : 0.49ms
stdev : 0.27ms

(137 measurements)

It means, when focussing on the median, Redis is 20 times faster than PostgreSQL at writing these JSON blobs.

Conclusion and discussion

First of all, I'm still a PostgreSQL fan-boy and have no intention of ceasing that. These times are made up of much more than just the individual databases. For example, the PostgreSQL speeds depend on the Django ORM code that makes the SQL and sends the query and then turns it into the model instance. I don't know what the proportions are between that and the actual bytes-from-PG's-disk times. But I'm not sure I care either. The tooling around the database is inevitable mostly and it's what matters to users.

Both Redis and PostgreSQL are persistent and survive server restarts and crashes etc. And you get so many more "batch related" features with PostgreSQL if you need them, such as being able to get a list of the last 10 rows added for some post-processing batch job.

I'm currently using Django's cache framework, with Redis as its backend, and it's a cache framework. It's not meant to be a persistent database. I like the idea that if I really have to I can just flush the cache and although detrimental to performance (temporarily) it shouldn't be a disaster. So I think what I'll do is store these JSON blobs in both databases. Yes, it means roughly 6GB of SSD storage but it also potentially means loading a LOT more into RAM on my limited server. That extra RAM usage pretty much sums of this whole blog post; of course it's faster if you can rely on RAM instead of disk. Now I just need to figure out how RAM I can afford myself for this piece and whether it's worth it.

UPDATE September 29, 2019

I experimented with an optimization of NOT turning the Django ORM query into a model instance for each record. Instead, I did this:


+from dataclasses import dataclass


+@dataclass
+class _Lookup:
+    modified: datetime.datetime
+    matches: list

...

+base_qs = base_qs.values_list("modified", "matches")
-lookup = base_qs.get(song__id=song_id)
+lookup_tuple = base_qs.get(song__id=song_id)
+lookup = _Lookup(*lookup_tuple)

print(lookup.modified)

Basically, let the SQL driver's "raw Python" content come through the Django ORM. The old difference between PostgreSQL and Redis was 16x. The new difference was 14x instead.

uwsgi weirdness with --http

September 19, 2019
2 comments Python, Linux

Instead of upgrading everything on my server, I'm just starting from scratch. From Ubuntu 16.04 to Ubuntu 19.04 and I also upgraded everything else in sight. One of them was uwsgi. I copied various user config files but for uwsgi things didn't very well. On the old server I had uwsgi version 2.0.12-debian and on the new one 2.0.18-debian. The uWSGI changelog is pretty hard to read but I sure don't see any mention of this.

You see, on SongSearch I have it so that Nginx talks to Django via a uWSGI socket. But the NodeJS server talks to Django via 127.0.0.1:PORT. So I need my uWSGI config to start both. Here was the old config:

[uwsgi]
plugins = python35
virtualenv = /var/lib/django/songsearch/venv
pythonpath = /var/lib/django/songsearch
user = django
uid = django
master = true
processes = 3
enable-threads = true
touch-reload = /var/lib/django/songsearch/uwsgi-reload.touch
http = 127.0.0.1:9090
module = songsearch.wsgi:application
env = LANG=en_US.utf8
env = LC_ALL=en_US.UTF-8
env = LC_LANG=en_US.UTF-8

(The only difference on the new server was the python37 plugin instead)

I start it and everything looks fine. No errors in the log files. And netstat looks like this:

# netstat -ntpl | grep 9090
tcp        0      0 127.0.0.1:9090          0.0.0.0:*               LISTEN      1855/uwsgi

But every time I try to curl localhost:9090 I kept getting curl: (52) Empty reply from server. Nothing in the log files! It seemed no matter what I tried I just couldn't talk to it over HTTP. No, I'm not a sysadmin. I'm just a hobbyist trying to stand up my little server with the tools and limited techniques I know but I was stumped.

The solution

After endless Googling for a resolution and trying all sorts of uwsgi commands directly, I somehow stumbled on the solution.


[uwsgi]
plugins = python35
virtualenv = /var/lib/django/songsearch/venv
pythonpath = /var/lib/django/songsearch
user = django
uid = django
master = true
processes = 3
enable-threads = true
touch-reload = /var/lib/django/songsearch/uwsgi-reload.touch
-http = 127.0.0.1:9090
+http-socket = 127.0.0.1:9090
module = songsearch.wsgi:application
env = LANG=en_US.utf8
env = LC_ALL=en_US.UTF-8
env = LC_LANG=en_US.UTF-8

With this one subtle change, I can now curl localhost:9090 and I still have the /var/run/uwsgi/app/songsearch/socket socket. So, yay!

I'm blogging about this in case someone else ever gets stuck in the same nasty surprise as me.

Also, I have to admit, I was fuming with rage from this frustration. It's really inspired me to revive the quest for an alternative to uwsgi because I'm not sure it's that great anymore. There are new alternatives such as gunicorn, gunicorn with Meinheld, bjoern etc.

Fastest Python function to slugify a string

September 12, 2019
4 comments Python

In MDN I noticed a function that turns a piece of text (Python 2 unicode) into a slug. It looks like this:


    non_url_safe = ['"', '#', '$', '%', '&', '+',
                    ',', '/', ':', ';', '=', '?',
                    '@', '[', '\\', ']', '^', '`',
                    '{', '|', '}', '~', "'"]

    def slugify(self, text):
        """
        Turn the text content of a header into a slug for use in an ID
        """
        non_safe = [c for c in text if c in self.non_url_safe]
        if non_safe:
            for c in non_safe:
                text = text.replace(c, '')
        # Strip leading, trailing and multiple whitespace, convert remaining whitespace to _
        text = u'_'.join(text.split())
        return text

The code is 7-8 years old and relates to a migration when MDN was created as a Python fork from an existing PHP solution.

I couldn't help but to react to the fact that it's a list and it's looped over every single time. Twice, in a sense. Python has built-in tools for this kinda stuff. Let's see if I can make it faster.

The candidates


translate_table = {ord(char): u'' for char in non_url_safe}
non_url_safe_regex = re.compile(
    r'[{}]'.format(''.join(re.escape(x) for x in non_url_safe)))


def _slugify1(self, text):
    non_safe = [c for c in text if c in self.non_url_safe]
    if non_safe:
        for c in non_safe:
            text = text.replace(c, '')
    text = u'_'.join(text.split())
    return text

def _slugify2(self, text):
    text = text.translate(self.translate_table)
    text = u'_'.join(text.split())
    return text

def _slugify3(self, text):
    text = self.non_url_safe_regex.sub('', text).strip()
    text = u'_'.join(re.split(r'\s+', text))
    return text

I wrote a thing that would call each one of the candidates, assert that their outputs always match and store how long each one took.

The results

The slowest is fast enough. But if you're still reading, here are the results:

_slugify1 0.101ms
_slugify2 0.019ms
_slugify3 0.033ms

So using a translate table is 5 times faster. And a regex 3 times faster. But they're all sufficiently fast.

Conclusion

This is the least of your problems in a world of real I/O such as databases and other genuinely CPU intense stuff. Well, it was fun little side-trip.

Also, aren't there better solutions that just blacklist all control characters?

Train your own spell corrector with TextBlob

August 23, 2019
0 comments Python

TextBlob is a wonderful Python library it. It wraps nltk with a really pleasant API. Out of the box, you get a spell-corrector. From the tutorial:


>>> from textblob import TextBlob
>>> b = TextBlob("I havv goood speling!")
>>> str(b.correct())
'I have good spelling!'

The way it works is that, shipped with the library, is this text file: en-spelling.txt It's about 30,000 lines long and looks like this:

;;;   Based on several public domain books from Project Gutenberg
;;;   and frequency lists from Wiktionary and the British National Corpus.
;;;   http://norvig.com/big.txt
;;;   
a 21155
aah 1
aaron 5
ab 2
aback 3
abacus 1
abandon 32
abandoned 72
abandoning 27

That gave me an idea! How about I use the TextBlob API but bring my own text as the training model. It doesn't have to be all that complicated.

The challenge

(Note: All the code I used for this demo is available here: github.com/peterbe/spellthese)

I found this site that lists "Top 1,000 Baby Boy Names". From that list, randomly pick a couple of out and mess with their spelling. Like, remove letters, add letters, and swap letters.

So, 5 random names now look like this:

▶ python challenge.py
RIGHT: jameson  TYPOED: jamesone
RIGHT: abel     TYPOED: aabel
RIGHT: wesley   TYPOED: welsey
RIGHT: thomas   TYPOED: thhomas
RIGHT: bryson   TYPOED: brysn

Imagine some application, where fat-fingered users typo those names on the right-hand side, and your job is to map that back to the correct spelling.

First, let's use the built in TextBlob.correct. A bit simplified but it looks like this:


from textblob import TextBlob


correct, typo = get_random_name()
b = TextBlob(typo)
result = str(b.correct())
right = correct == result
...

And the results:

▶ python test.py
ORIGIN         TYPO           RESULT         WORKED?
jesus          jess           less           Fail
austin         ausin          austin         Yes!
julian         juluian        julian         Yes!
carter         crarter        charter        Fail
emmett         emett          met            Fail
daniel         daiel          daniel         Yes!
luca           lua            la             Fail
anthony        anthonyh       anthony        Yes!
damian         daiman         cabman         Fail
kevin          keevin         keeping        Fail
Right 40.0% of the time

Buuh! Not very impressive. So what went wrong there? Well, the word met is much more common than emmett and the same goes for words like less, charter, keeping etc. You know, because English.

The solution

The solution is actually really simple. You just crack open the classes out of textblob like this:


from textblob import TextBlob
from textblob.en import Spelling

path = "spelling-model.txt"
spelling = Spelling(path=path)
# Here, 'names' is a list of all the 1,000 correctly spelled names.
# e.g. ['Liam', 'Noah', 'William', 'James', ...
spelling.train(" ".join(names), path)

Now, instead of corrected = str(TextBlob(typo).correct()) we do result = spelling.suggest(typo)[0][0] as demonstrated here:


correct, typo = get_random_name()
b = spelling.suggest(typo)
result = b[0][0]
right = correct == result
...

So, let's compare the two "side by side" and see how this works out. Here's the output of running with 20 randomly selected names:

▶ python test.py
UNTRAINED...
ORIGIN         TYPO           RESULT         WORKED?
juan           jaun           juan           Yes!
ethan          etha           the            Fail
bryson         brysn          bryan          Fail
hudson         hudsn          hudson         Yes!
oliver         roliver        oliver         Yes!
ryan           rnyan          ran            Fail
cameron        caeron         carron         Fail
christopher    hristopher     christopher    Yes!
elias          leias          elias          Yes!
xavier         xvaier         xvaier         Fail
justin         justi          just           Fail
leo            lo             lo             Fail
adrian         adian          adrian         Yes!
jonah          ojnah          noah           Fail
calvin         cavlin         calvin         Yes!
jose           joe            joe            Fail
carter         arter          after          Fail
braxton        brxton         brixton        Fail
owen           wen            wen            Fail
thomas         thoms          thomas         Yes!
Right 40.0% of the time

TRAINED...
ORIGIN         TYPO           RESULT         WORKED?
landon         landlon        landon         Yes
sebastian      sebstian       sebastian      Yes
evan           ean            ian            Fail
isaac          isaca          isaac          Yes
matthew        matthtew       matthew        Yes
waylon         ywaylon        waylon         Yes
sebastian      sebastina      sebastian      Yes
adrian         darian         damian         Fail
david          dvaid          david          Yes
calvin         calivn         calvin         Yes
jose           ojse           jose           Yes
carlos         arlos          carlos         Yes
wyatt          wyatta         wyatt          Yes
joshua         jsohua         joshua         Yes
anthony        antohny        anthony        Yes
christian      chrisian       christian      Yes
tristan        tristain       tristan        Yes
theodore       therodore      theodore       Yes
christopher    christophr     christopher    Yes
joshua         oshua          joshua         Yes
Right 90.0% of the time

See, with very little effort you can got from 40% correct to 90% correct.

Note, that the output of something like spelling.suggest('darian') is actually a list like this: [('damian', 0.5), ('adrian', 0.5)] and you can use that in your application. For example:

<li><a href="?name=damian">Did you mean <b>damian</b></a></li>
<li><a href="?name=adrian">Did you mean <b>adrian</b></a></li>

Bonus and conclusion

Ultimately, what TextBlob does is a re-implementation of Peter Norvig's original implementation from 2007. I too, have written my own implementation in 2007. Depending on your needs, you can just figure out the licensing of that source code and lift it out and implement in your custom ways. But TextBlob wraps it up nicely for you.

When you use the textblob.en.Spelling class you have some choices. First, like I did in my demo:


path = "spelling-model.txt"
spelling = Spelling(path=path)
spelling.train(my_space_separated_text_blob, path)

What that does is creating a file spelling-model.txt that wasn't there before. It looks like this (in my demo):

▶ head spelling-model.txt
aaron 1
abel 1
adam 1
adrian 1
aiden 1
alexander 1
andrew 1
angel 1
anthony 1
asher 1

The number (on the right) there is the "frequency" of the word. But what if you have a "scoring" number of your own. Perhaps, in your application you just know that adrian is more right than damian. Then, you can make your own file:

Suppose the text file ("spelling-model-weighted.txt") contains lines like this:

...
adrian 8
damian 3
...

Now, the output becomes:

>>> import os
>>> from textblob.en import Spelling
>>> import os
>>> path = "spelling-model-weighted.txt"
>>> assert os.path.isfile(path)
>>> spelling = Spelling(path=path)
>>> spelling.suggest('darian')
[('adrian', 0.7272727272727273), ('damian', 0.2727272727272727)]

Based on the weighting, these numbers add up. I.e. 3 / (3 + 8) == 0.2727272727272727

I hope it inspires you to write your own spelling application using TextBlob.

For example, you can feed it the names of your products on an e-commerce site. The .txt file might bloat if you have too much but note that the 30K lines en-spelling.txt is only 314KB and it loads in...:

>>> from textblob import TextBlob
>>> from time import perf_counter
>>> b = TextBlob("I havv goood speling!")
>>> t0 = perf_counter(); right = b.correct() ; t1 = perf_counter()
>>> t1 - t0
0.07055813199999861

...70ms for 30,000 words.

SongSearch autocomplete rate now 2+ per second

July 11, 2019
0 comments Django, Python, Nginx, Redis

By analyzing my Nginx logs, I've concluded that SongSearch's autocomplete JSON API now gets about 2.2 requests per second. I.e. these are XHR requests to /api/search/autocomplete?q=....

Roughly, 1.8 requests per second goes back to the Django/Elasticsearch backend. That's a hit ratio of 16%. These Django/Elasticsearch requests take roughly 200ms on average. I suspect about 150-180ms of that time is spent querying Elasticsearch, the rest being Python request/response and JSON "paperwork".

Autocomplete counts in Datadog

Caching strategy

Caching is hard because the queries are so vastly different over time. Had I put a Redis cache decorator on the autocomplete Django view function I'd quickly bloat Redis memory and cause lots of evictions.

What I used to do was something like this:


def search_autocomplete(request):
   q = request.GET.get('q') 

   cache_key = None
   if len(q) < 10:
      cache_key = 'autocomplete:' + q
      results = cache.get(cache_key)
      if results is not None:
          return http.JsonResponse(results)

   results = _do_elastisearch_query(q)
   if cache_key:
       cache.set(cache_key, results, 60 * 60)

   return http.JsonResponse(results)   

However, after some simple benchmarking it was clear that using Nginx' uwsgi_cache it was much faster to let the cacheable queries terminate already at Nginx. So I changed the code to something like this:


def search_autocomplete(request):
   q = request.GET.get('q') 
   results = _do_elastisearch_query(q)
   response = http.JsonResponse(results)   

   if len(q) < 10:
       patch_cache_control(response, public=True, max_age=60 * 60)

   return response

The only annoying thing about Nginx caching is that purging is hard unless you go for that Nginx Plus (or whatever their enterprise version is called). But more annoying, to me, is that fact that I can't really see what this means for my server. When I was caching with Redis I could just use redis-cli and...

> INFO
...
# Memory
used_memory:123904288
used_memory_human:118.16M
...

Nginx Amplify

My current best tool for keeping an eye on Nginx is Nginx Amplify. It gives me some basic insights about the state of things. Here are some recent screenshots:

NGINX Requests/s

NGINX Memory Usage

NGINX CPU Usage %

Thoughts and conclusion

Caching is hard. But it's also fun because it ties directly into performance work.

In my business logic, I chose that autocomplete queries that are between 1 and 9 characters are cacheable. And I picked a TTL of 60 minutes. At this point, I'm not sure exactly why I chose that logic but I remember doing some back-of-envelope calculations about what the hit ratio would be and roughly what that would mean in bytes in RAM. I definitely remember picking 60 minutes because I was nervous about bloating Nginx's memory usage. But as of today, I'm switching that up to 24 hours and let's see what that does to my current 16% Nginx cache hit ratio. At the moment, /var/cache/nginx-cache/ is only 34MB which isn't much.

Another crux with using uwsgi_cache (or proxy_cache) is that you can't control the cache key very well. When it was all in Python I was able to decide about the cache key myself. A plausible implementation is cache_key = q.lower().strip() for example. That means you can protect your Elasticsearch backend from having to do {"q": "A"} and {"q": "a"}. Who knows, perhaps there is a way to hack this in Nginx without compiling in some Lua engine.

The ideal would be some user-friendly diagnostics tool that I can point somewhere, towards Nginx, that says how much my uwsgi_cache is hurting or saving me. Autocomplete is just one of many things going on on this single DigitalOcean server. There's also a big PostgreSQL server, a node-express cluster, a bunch of uwsgi workers, Redis, lots of cron job scripts, and of course a big honking Elasticsearch 6.

UPDATE (July 12 2019)

Currently, and as mentioned above, I only set Cache-Control headers (which means Nginx snaps it up) for queries that at max 9 characters long. I wanted to appreciate and understand how ratio of all queries are longer than 9 characters so I wrote a report and its output is this:

POINT: 7
Sum show 75646 32.2%
Sum rest 159321 67.8%

POINT: 8
Sum show 83702 35.6%
Sum rest 151265 64.4%

POINT: 9
Sum show 90870 38.7%
Sum rest 144097 61.3%

POINT: 10
Sum show 98384 41.9%
Sum rest 136583 58.1%

POINT: 11
Sum show 106093 45.2%
Sum rest 128874 54.8%

POINT: 12
Sum show 113905 48.5%
Sum rest 121062 51.5%

It means that (independent of time expiry) 38.7% of queries are 9 characters or less.

Build an XML sitemap of XML sitemaps

June 1, 2019
0 comments Django, Python

Suppose that you have so many thousands of pages that you can't just create a single /sitemap.xml file that has all the URLs (aka <loc>) listed. Then you need to make a /sitemaps.xml that points to the other sitemap files. And if you're in the thousands, you'll need to gzip these files.

The blog post demonstrates how Song Search generates a sitemap file that points to 63 sitemap-{M}-{N}.xml.gz files which spans about 1,000,000 URLs. The context here is Python and the getting of the data is from Django. Python is pretty key here but if you have something other than Django, you can squint and mentally replace that with your own data mapper.

Generate the sitemap .xml.gz file(s)

Here's the core of the work. A generator function that takes a Django QuerySet instance (that is ordered and filtered!) and then starts generating etree trees and dumps them to disk with gzip.


import gzip

from lxml import etree


outfile = "sitemap-{start}-{end}.xml"
batchsize = 40_000


def generate(self, qs, base_url, outfile, batchsize):
    # Use `.values` to make the query much faster
    qs = qs.values("name", "id", "artist_id", "language")

    def start():
        return etree.Element(
            "urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        )

    def close(root, filename):
        with gzip.open(filename, "wb") as f:
            f.write(b'<?xml version="1.0" encoding="utf-8"?>\n')
            f.write(etree.tostring(root, pretty_print=True))

    root = filename = None

    count = 0
    for song in qs.iterator():
        if not count % batchsize:
            if filename:  # not the very first loop
                close(root, filename)
                yield filename
            filename = outfile.format(start=count, end=count + batchsize)
            root = start()
        loc = "{}{}".format(base_url, make_song_url(song))
        etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
        count += 1
    close(root, filename)
    yield filename

The most important lines in terms of lxml.etree and sitemaps are:


root = etree.Element("urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
...         
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc

Another important thing is the note about using .values(). If you don't do that Django will create a model instance for every single row it returns of the iterator. That's expensive. See this blog post.

Another important thing is to use a Django ORM iterator as that's much more efficient than messing around with limits and offsets.

Generate the map of sitemaps

Making the map of maps doesn't need to be gzipped since it's going to be tiny.


def generate_map_of_maps(base_url, outfile):
    root = etree.Element(
        "sitemapindex", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    )

    with open(outfile, "wb") as f:
        f.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
        files_created = sorted(glob("sitemap-*.xml.gz"))
        for file_created in files_created:
            sitemap = etree.SubElement(root, "sitemap")
            uri = "{}/{}".format(base_url, os.path.basename(file_created))
            etree.SubElement(sitemap, "loc").text = uri
            lastmod = datetime.datetime.fromtimestamp(
                os.stat(file_created).st_mtime
            ).strftime("%Y-%m-%d")
            etree.SubElement(sitemap, "lastmod").text = lastmod
        f.write(etree.tostring(root, pretty_print=True))

And that sums it up. On my laptop, it takes about 60 seconds to generate 39 of these files (e.g. sitemap-1560000-1600000.xml.gz) and that's good enough.

Bonus and Thoughts

The bad news is that this is about as good as it gets in terms of performance. The good news is that there are no low-hanging fruit fixes. I know, because I tried. I experimented with not using pretty_print=True and I experimented with not writing with gzip.open and instead gzipping the files on later. Nothing made any significant difference. The lxml.etree part of this, in terms of performance, is order of maginitude marginal in comparison to the cost of actually getting the data out of the database plus later writing to disk. I also experimenting with generating the gzip content with zopfli and it didn't make much of a difference.

I originally wrote this code years ago and when I did, I think I knew more about sitemaps. In my implementation I use a batch size of 40,000 so each file is called something like sitemap-40000-80000.xml.gz and weighs about 800KB. Not sure why I chose 40,000 but perhaps not important.