Filtered by Python

Page 5

Reset

SongSearch autocomplete rate now 2+ per second

July 11, 2019
0 comments Django, Python, Nginx, Redis

By analyzing my Nginx logs, I've concluded that SongSearch's autocomplete JSON API now gets about 2.2 requests per second. I.e. these are XHR requests to /api/search/autocomplete?q=....

Roughly, 1.8 requests per second goes back to the Django/Elasticsearch backend. That's a hit ratio of 16%. These Django/Elasticsearch requests take roughly 200ms on average. I suspect about 150-180ms of that time is spent querying Elasticsearch, the rest being Python request/response and JSON "paperwork".

Autocomplete counts in Datadog

Caching strategy

Caching is hard because the queries are so vastly different over time. Had I put a Redis cache decorator on the autocomplete Django view function I'd quickly bloat Redis memory and cause lots of evictions.

What I used to do was something like this:


def search_autocomplete(request):
   q = request.GET.get('q') 

   cache_key = None
   if len(q) < 10:
      cache_key = 'autocomplete:' + q
      results = cache.get(cache_key)
      if results is not None:
          return http.JsonResponse(results)

   results = _do_elastisearch_query(q)
   if cache_key:
       cache.set(cache_key, results, 60 * 60)

   return http.JsonResponse(results)   

However, after some simple benchmarking it was clear that using Nginx' uwsgi_cache it was much faster to let the cacheable queries terminate already at Nginx. So I changed the code to something like this:


def search_autocomplete(request):
   q = request.GET.get('q') 
   results = _do_elastisearch_query(q)
   response = http.JsonResponse(results)   

   if len(q) < 10:
       patch_cache_control(response, public=True, max_age=60 * 60)

   return response

The only annoying thing about Nginx caching is that purging is hard unless you go for that Nginx Plus (or whatever their enterprise version is called). But more annoying, to me, is that fact that I can't really see what this means for my server. When I was caching with Redis I could just use redis-cli and...

> INFO
...
# Memory
used_memory:123904288
used_memory_human:118.16M
...

Nginx Amplify

My current best tool for keeping an eye on Nginx is Nginx Amplify. It gives me some basic insights about the state of things. Here are some recent screenshots:

NGINX Requests/s

NGINX Memory Usage

NGINX CPU Usage %

Thoughts and conclusion

Caching is hard. But it's also fun because it ties directly into performance work.

In my business logic, I chose that autocomplete queries that are between 1 and 9 characters are cacheable. And I picked a TTL of 60 minutes. At this point, I'm not sure exactly why I chose that logic but I remember doing some back-of-envelope calculations about what the hit ratio would be and roughly what that would mean in bytes in RAM. I definitely remember picking 60 minutes because I was nervous about bloating Nginx's memory usage. But as of today, I'm switching that up to 24 hours and let's see what that does to my current 16% Nginx cache hit ratio. At the moment, /var/cache/nginx-cache/ is only 34MB which isn't much.

Another crux with using uwsgi_cache (or proxy_cache) is that you can't control the cache key very well. When it was all in Python I was able to decide about the cache key myself. A plausible implementation is cache_key = q.lower().strip() for example. That means you can protect your Elasticsearch backend from having to do {"q": "A"} and {"q": "a"}. Who knows, perhaps there is a way to hack this in Nginx without compiling in some Lua engine.

The ideal would be some user-friendly diagnostics tool that I can point somewhere, towards Nginx, that says how much my uwsgi_cache is hurting or saving me. Autocomplete is just one of many things going on on this single DigitalOcean server. There's also a big PostgreSQL server, a node-express cluster, a bunch of uwsgi workers, Redis, lots of cron job scripts, and of course a big honking Elasticsearch 6.

UPDATE (July 12 2019)

Currently, and as mentioned above, I only set Cache-Control headers (which means Nginx snaps it up) for queries that at max 9 characters long. I wanted to appreciate and understand how ratio of all queries are longer than 9 characters so I wrote a report and its output is this:

POINT: 7
Sum show 75646 32.2%
Sum rest 159321 67.8%

POINT: 8
Sum show 83702 35.6%
Sum rest 151265 64.4%

POINT: 9
Sum show 90870 38.7%
Sum rest 144097 61.3%

POINT: 10
Sum show 98384 41.9%
Sum rest 136583 58.1%

POINT: 11
Sum show 106093 45.2%
Sum rest 128874 54.8%

POINT: 12
Sum show 113905 48.5%
Sum rest 121062 51.5%

It means that (independent of time expiry) 38.7% of queries are 9 characters or less.

Build an XML sitemap of XML sitemaps

June 1, 2019
0 comments Django, Python

Suppose that you have so many thousands of pages that you can't just create a single /sitemap.xml file that has all the URLs (aka <loc>) listed. Then you need to make a /sitemaps.xml that points to the other sitemap files. And if you're in the thousands, you'll need to gzip these files.

The blog post demonstrates how Song Search generates a sitemap file that points to 63 sitemap-{M}-{N}.xml.gz files which spans about 1,000,000 URLs. The context here is Python and the getting of the data is from Django. Python is pretty key here but if you have something other than Django, you can squint and mentally replace that with your own data mapper.

Generate the sitemap .xml.gz file(s)

Here's the core of the work. A generator function that takes a Django QuerySet instance (that is ordered and filtered!) and then starts generating etree trees and dumps them to disk with gzip.


import gzip

from lxml import etree


outfile = "sitemap-{start}-{end}.xml"
batchsize = 40_000


def generate(self, qs, base_url, outfile, batchsize):
    # Use `.values` to make the query much faster
    qs = qs.values("name", "id", "artist_id", "language")

    def start():
        return etree.Element(
            "urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        )

    def close(root, filename):
        with gzip.open(filename, "wb") as f:
            f.write(b'<?xml version="1.0" encoding="utf-8"?>\n')
            f.write(etree.tostring(root, pretty_print=True))

    root = filename = None

    count = 0
    for song in qs.iterator():
        if not count % batchsize:
            if filename:  # not the very first loop
                close(root, filename)
                yield filename
            filename = outfile.format(start=count, end=count + batchsize)
            root = start()
        loc = "{}{}".format(base_url, make_song_url(song))
        etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
        count += 1
    close(root, filename)
    yield filename

The most important lines in terms of lxml.etree and sitemaps are:


root = etree.Element("urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
...         
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc

Another important thing is the note about using .values(). If you don't do that Django will create a model instance for every single row it returns of the iterator. That's expensive. See this blog post.

Another important thing is to use a Django ORM iterator as that's much more efficient than messing around with limits and offsets.

Generate the map of sitemaps

Making the map of maps doesn't need to be gzipped since it's going to be tiny.


def generate_map_of_maps(base_url, outfile):
    root = etree.Element(
        "sitemapindex", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    )

    with open(outfile, "wb") as f:
        f.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
        files_created = sorted(glob("sitemap-*.xml.gz"))
        for file_created in files_created:
            sitemap = etree.SubElement(root, "sitemap")
            uri = "{}/{}".format(base_url, os.path.basename(file_created))
            etree.SubElement(sitemap, "loc").text = uri
            lastmod = datetime.datetime.fromtimestamp(
                os.stat(file_created).st_mtime
            ).strftime("%Y-%m-%d")
            etree.SubElement(sitemap, "lastmod").text = lastmod
        f.write(etree.tostring(root, pretty_print=True))

And that sums it up. On my laptop, it takes about 60 seconds to generate 39 of these files (e.g. sitemap-1560000-1600000.xml.gz) and that's good enough.

Bonus and Thoughts

The bad news is that this is about as good as it gets in terms of performance. The good news is that there are no low-hanging fruit fixes. I know, because I tried. I experimented with not using pretty_print=True and I experimented with not writing with gzip.open and instead gzipping the files on later. Nothing made any significant difference. The lxml.etree part of this, in terms of performance, is order of maginitude marginal in comparison to the cost of actually getting the data out of the database plus later writing to disk. I also experimenting with generating the gzip content with zopfli and it didn't make much of a difference.

I originally wrote this code years ago and when I did, I think I knew more about sitemaps. In my implementation I use a batch size of 40,000 so each file is called something like sitemap-40000-80000.xml.gz and weighs about 800KB. Not sure why I chose 40,000 but perhaps not important.

Generate a random IP address in Python

June 1, 2019
0 comments Django, Python

I have a commenting system where people can type in a comment and optionally their name and email if they like.
In production, where things are real, the IP address that can be collected are all interestingly different. But when testing this manually on my laptop, since the server is running http://localhost:8000, the request.META.get('REMOTE_ADDR') always becomes 127.0.0.1. Boring! So I fake it. Like this:


import random
from ipaddress import IPv4Address


def _random_ip_address(seed):
    random.seed(seed)
    return str(IPv4Address(random.getrandbits(32)))


...
# Here's the code deep inside the POST handler just before storing 
# the form submission the database.

if settings.DEBUG and metadata.get("REMOTE_ADDR") == "127.0.0.1":
    # Make up a random one!
    metadata["REMOTE_ADDR"] = _random_ip_address(
        str(form.cleaned_data["name"]) + str(form.cleaned_data["email"])
    )

It's pretty rough but it works and makes me happy.

How I simulate a CDN with Nginx

May 15, 2019
1 comment Python, Nginx

Usually, a CDN is just a cache you put in front of a dynamic website. You set up the CDN to be the first server your clients get data from, the CDN quickly decides if it was a copy cached or otherwise it asks the origin server for a fresh copy. So far so good, but if you really care about squeezing that extra performance out you need to worry about having a decent TTL and as soon as you make the TTL more than a couple of minutes you need to think about cache invalidation. You also need to worry about preventing certain endpoints from ever getting caught in the CDN which could be very bad.

For this site, www.peterbe.com, I'm using KeyCDN which I've blogged out here: "I think I might put my whole site behind a CDN" and here: "KeyCDN vs. DigitalOcean Nginx". KeyCDN has an API and a python client which I've contributed to.

The next problem is; how do you test all this stuff on your laptop? Unfortunately, you can't deploy a KeyCDN docker image or something like that, that attempts to mimic how it works for reals. So, to simulate a CDN locally on my laptop, I'm using Nginx. It's definitely pretty different but it's not the point. The point is that you want something that acts as a reverse proxy. You want to make sure that stuff that's supposed to be cached gets cached, stuff that's supposed to be purged gets purged and that things that are always supposed to be dynamic is always dynamic.

The Configuration

First I add peterbecom.local into /etc/hosts like this:

▶ cat /etc/hosts | grep peterbecom.local
127.0.0.1       peterbecom.local origin.peterbecom.local
::1             peterbecom.local origin.peterbecom.local

Next, I set up the Nginx config (running on port 80) and the configuration looks like this:

proxy_cache_path /tmp/nginxcache  levels=1:2    keys_zone=STATIC:10m
    inactive=24h  max_size=1g;

server {
    server_name peterbecom.local;
    location / {
        proxy_cache_bypass $http_secret_header;
        add_header X-Cache $upstream_cache_status;
        proxy_set_header x-forwarded-host $host;
        proxy_cache STATIC;
        # proxy_cache_key $uri;
        proxy_cache_valid 200  1h;
        proxy_pass http://origin.peterbecom.local;
    }
    access_log /tmp/peterbecom.access.log combined;
    error_log /tmp/peterbecom.error.log info;
}

By the way, I've also set up origin.peterbecom.local to be run in Nginx too but it could just be proxy_pass http://localhost:8000; to go straight to Django. Not relevant for this context.

The Purge

Without the commercial version of Nginx (Plus) you can't do easy purging just for purging sake. But with proxy_cache_bypass $http_secret_header; it's very similar to purging except that it immediately makes a request to the origin.

First, to test that it works, I start up Nginx and Django and now I can run:

▶ curl -v http://peterbecom.local/about > /dev/null
< HTTP/1.1 200 OK
< Server: nginx/1.15.10
< Cache-Control: public, max-age=3672
< X-Cache: MISS
...

(Note the X-Cache: MISS which comes from add_header X-Cache $upstream_cache_status;)

This should trigger a log line in /tmp/peterbecom.access.log and in the Django runserver foreground logs.

At this point, I can kill the Django server and run it again:

▶ curl -v http://peterbecom.local/about > /dev/null
< Server: nginx/1.15.10
< HTTP/1.1 200 OK
< Cache-Control: max-age=86400
< Cache-Control: public
< X-Cache: HIT
...

Cool! It's working without Django running. As expected. This is how to send a "purge request"

▶ curl -v -H "secret-header:true" http://peterbecom.local/about > /dev/null
> GET /about HTTP/1.1
> secret-header:true
>
< HTTP/1.1 502 Bad Gateway
...

Clearly, it's trying to go to the origin, which was killed, so you start that up again and you get back to:

▶ curl -v http://peterbecom.local/about > /dev/null
< HTTP/1.1 200 OK
< Server: nginx/1.15.10
< Cache-Control: public, max-age=3672
< X-Cache: MISS
...

In Python

In my site, there are Django signals that are triggered when a piece of content changes and I'm using python-keycdn-api in production but obviously, that won't work with Nginx. So I have a local setting and my Python code looks like this:



# This function gets called by a Django `post_save` signal
# among other things such as cron jobs and management commands.

def purge_cdn_urls(urls):
    if settings.USE_NGINX_BYPASS:
        # Note! This Nginx trick will not just purge the proxy_cache, it will
        # immediately trigger a refetch.
        x_cache_headers = []
        for url in urls:
            if "://" not in url:
                url = settings.NGINX_BYPASS_BASEURL + url
            r = requests.get(url, headers={"secret-header": "true"})
            r.raise_for_status()
            x_cache_headers.append({"url": url, "x-cache": r.headers.get("x-cache")})
        print("PURGED:", x_cache_headers)
        return 

    ...the stuff that uses keycdn...

Notes and Conclusion

One important feature is that my CDN is a CNAME for www.peterbe.com but it reaches the origin server on a different URL. When my Django code needs to know the outside facing domain, I need to respect that. The communication between by the CDN and my origin is a domain I don't want to expose. What KeyCDN does is that they send an x-forwarded-host header which I need to take into account when understanding what outward facing absolute URL was used. Here's how I do that:


def get_base_url(request):
    base_url = ["http"]
    if request.is_secure():
        base_url.append("s")
    base_url.append("://")
    x_forwarded_host = request.headers.get("X-Forwarded-Host")
    if x_forwarded_host and x_forwarded_host in settings.ALLOWED_HOSTS:
        base_url.append(x_forwarded_host)
    else:
        base_url.append(request.get_host())
    return "".join(base_url)

That's about it. There are lots of other details I glossed over but the point is that this works good enough to test that the cache invalidation works as expected.

Whatsdeployed rewritten in React

April 15, 2019
0 comments Web development, Python, React, JavaScript

A couple of months ago my colleague Michael @mythmon Cooper wanted to add a feature to the front-end code of Whatsdeployed and learned that the whole front-end is spaghetti jQuery code. So, instead, he re-wrote it in React. My only requirements were "Use create-react-app and no redux", i.e. keep it simple.

We also took the opportunity to rewrite some of the ways that URLs are handled. It used to be that a "short link" would redirect. For example GET /s-5HY would return 302 to Location: ?org=mozilla&repo=tecken&name[]=Dev&url[]=https://symbols.dev.mozaws.net/__version__&name[]=Stage... Basically, the short link was just an alias for a redirect. Just like those services like bit.ly or g.co. Now, the short link is a permanent fixture. The short link is included in the XHR calls to the server for getting the relevant data.

All old URLs will continue to work but now the canonical URL becomes /s/5HY/mozilla-services/tecken, for example. The :org/:repo isn't really necessary because the server knows exactly what 5HY (in this example means), but it's nice for the URL bar's memory.

Another thing that changed was how it can recognize "bors commits". When you use bors, you put a bunch of commits into a GitHub Pull Request and then ask the bors bot to merge them into master. Using "bors mode" in Whatsdeployed is optional but we believe it looks a lot more user-friendly. Here is an example of mozilla/normandy with and without bors toggled on and off.

Without "bors mode"
Without "bors mode"

With "bors mode"
With "bors mode"

Thank you mythmon!

Lastly, hopefully this will make it a lot easier to contribute. Check out https://github.com/peterbe/whatsdeployed. All you need is Python 3, a PostgreSQL, and almost any version of Node that can run create-react-apps. Ping me if you find it hard to get up and running.

Django ORM optimization story on selecting the least possible

February 22, 2019
16 comments Web development, Django, Python, PostgreSQL

This an optimization story that should not surprise anyone using the Django ORM. But I thought I'd share because I have numbers now! The origin of this came from a real requirement. For a given parent model, I'd like to extract the value of the name column of all its child models, and the turn all these name strings into 1 MD5 checksum string.

Variants

The first attempted looked like this:


artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist):
    names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

The SQL used to generate this is as follows:


SELECT "main_song"."id", "main_song"."artist_id", "main_song"."name", 
"main_song"."text", "main_song"."language", "main_song"."key_phrases", 
"main_song"."popularity", "main_song"."text_length", "main_song"."metadata", 
"main_song"."created", "main_song"."modified", 
"main_song"."has_lastfm_listeners", "main_song"."has_spotify_popularity" 
FROM "main_song" WHERE "main_song"."artist_id" = 22729;

Clearly, I don't need anything but just the name column, version 2:


artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist).only("name"):
    names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

Now, the SQL used is:


SELECT "main_song"."id", "main_song"."name" 
FROM "main_song" WHERE "main_song"."artist_id" = 22729;

But still, since I don't really need instances of model class Song I can use the .values() method which gives back a list of dictionaries. This is version 3:


names = []
for song in Song.objects.filter(artist=a).values("name"):
    names.append(song["name"])
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

This time Django figures it doesn't even need the primary key value so it looks like this:


SELECT "main_song"."name" FROM "main_song" WHERE "main_song"."artist_id" = 22729;

Last but not least; there is an even faster one. values_list(). This time it doesn't even bother to map the column name to the value in a dictionary. And since I only need 1 column's value, I can set flat=True. Version 4 looks like this:


names = []
for name in Song.objects.filter(artist=a).values_list("name", flat=True):
    names.append(name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

Same SQL gets used this time as in version 3.

The benchmark

Hopefully this little benchmark script speaks for itself:


from songsearch.main.models import *

import hashlib


def f1(a):
    names = []
    for song in Song.objects.filter(artist=a):
        names.append(song.name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f2(a):
    names = []
    for song in Song.objects.filter(artist=a).only("name"):
        names.append(song.name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f3(a):
    names = []
    for song in Song.objects.filter(artist=a).values("name"):
        names.append(song["name"])
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f4(a):
    names = []
    for name in Song.objects.filter(artist=a).values_list("name", flat=True):
        names.append(name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


artist = Artist.objects.get(name="Bad Religion")
print(Song.objects.filter(artist=artist).count())

print(f1(artist) == f2(artist))
print(f2(artist) == f3(artist))
print(f3(artist) == f4(artist))

# Reporting
import time
import random
import statistics

functions = f1, f2, f3, f4
times = {f.__name__: [] for f in functions}

for i in range(500):
    func = random.choice(functions)
    t0 = time.time()
    func(artist)
    t1 = time.time()
    times[func.__name__].append((t1 - t0) * 1000)

for name in sorted(times):
    numbers = times[name]
    print("FUNCTION:", name, "Used", len(numbers), "times")
    print("\tBEST", min(numbers))
    print("\tMEDIAN", statistics.median(numbers))
    print("\tMEAN  ", statistics.mean(numbers))
    print("\tSTDEV ", statistics.stdev(numbers))

I ran this on my PostgreSQL 11.1 on my MacBook Pro with Django 2.1.7. So the database is on localhost.

The results

276
True
True
True
FUNCTION: f1 Used 135 times
    BEST 6.309986114501953
    MEDIAN 7.531881332397461
    MEAN   7.834429211086697
    STDEV  2.03779968066591
FUNCTION: f2 Used 135 times
    BEST 3.039121627807617
    MEDIAN 3.7298202514648438
    MEAN   4.012803678159361
    STDEV  1.8498943539073027
FUNCTION: f3 Used 110 times
    BEST 0.9920597076416016
    MEDIAN 1.4405250549316406
    MEAN   1.5053835782137783
    STDEV  0.3523240470133114
FUNCTION: f4 Used 120 times
    BEST 0.9369850158691406
    MEDIAN 1.3251304626464844
    MEAN   1.4017681280771892
    STDEV  0.3391019435930447

Bar chart

Discussion

I guess the hashlib.md5("".join(names).encode("utf-8")).hexdigest() stuff is a bit "off-topic" but I checked and it's roughly 300 times faster than building up the names list.

It's clearly better to ask less of Python and PostgreSQL to get a better total time. No surprise there. What was interesting was the proportion of these differences. Memorize that and you'll be better equipped if it's worth the hassle of not using the Django ORM in the most basic form.

Also, do take note that this is only relevant in when dealing with many records. The slowest variant (f1) takes, on average, 7 milliseconds.

Summarizing the difference with percentages compared to the fastest variant:

  • f1 - 573% slower
  • f2 - 225% slower
  • f3 - 6% slower
  • f4 - 0% slower

UPDATE Feb 25 2019

James suggested, although a bit "missing the point", that it could be even faster if all the aggregation is pushed into the PostgreSQL server and then the only thing that needs to transfer from PostgreSQL to Python is the final result.

By the way, name column in this particular benchmark, when concatenated into one big string, is ~4KB. So, with variant f5 it only needs to transfer 32 bytes which will/would make a bigger difference if the network latency is higher.

Here's the whole script: https://gist.github.com/peterbe/b2b7ed95d422ab25a65639cb8412e75e

And the results:

276
True
True
True
False
False
FUNCTION: f1 Used 92 times
    BEST 5.928993225097656
    MEDIAN 7.311463356018066
    MEAN   7.594626882801885
    STDEV  2.2027017044658423
FUNCTION: f2 Used 75 times
    BEST 2.878904342651367
    MEDIAN 3.3979415893554688
    MEAN   3.4774907430013022
    STDEV  0.5120246550765524
FUNCTION: f3 Used 88 times
    BEST 0.9310245513916016
    MEDIAN 1.1944770812988281
    MEAN   1.3105544176968662
    STDEV  0.35922655625999383
FUNCTION: f4 Used 71 times
    BEST 0.7879734039306641
    MEDIAN 1.1661052703857422
    MEAN   1.2262606284987758
    STDEV  0.3561764250427344
FUNCTION: f5 Used 90 times
    BEST 0.7929801940917969
    MEDIAN 1.0334253311157227
    MEAN   1.1836051940917969
    STDEV  0.4001442703048186
FUNCTION: f6 Used 84 times
    BEST 0.80108642578125
    MEDIAN 1.1119842529296875
    MEAN   1.2281338373819988
    STDEV  0.37146893005516973

Result: f5 is takes 0.793ms and (the previous "winner") f4 takes 0.788ms.

I'm not entirely sure why f5 isn't faster but I suspect it's because the dataset is too small for it all to matter.

Compare:

songsearch=# explain analyze SELECT "main_song"."name" FROM "main_song" WHERE "main_song"."artist_id" = 22729;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using main_song_ca949605 on main_song  (cost=0.43..229.33 rows=56 width=16) (actual time=0.014..0.208 rows=276 loops=1)
   Index Cond: (artist_id = 22729)
 Planning Time: 0.113 ms
 Execution Time: 0.242 ms
(4 rows)

with...

songsearch=# explain analyze SELECT md5(STRING_AGG("main_song"."name", '')) AS "names_hash" FROM "main_song" WHERE "main_song"."artist_id" = 22729;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=229.47..229.48 rows=1 width=32) (actual time=0.278..0.278 rows=1 loops=1)
   ->  Index Scan using main_song_ca949605 on main_song  (cost=0.43..229.33 rows=56 width=16) (actual time=0.019..0.204 rows=276 loops=1)
         Index Cond: (artist_id = 22729)
 Planning Time: 0.115 ms
 Execution Time: 0.315 ms
(5 rows)

I ran these two SQL statements about 100 times each and recorded their best possible execution times:

1) The plain SELECT - 0.99ms
2) The STRING_AGG - 1.06ms

So that accounts from ~0.1ms difference only! Which kinda matches the results seen above. All in all, I think the dataset is too small to demonstrate this technique. But, considering the chance that the complexity might not be linear with the performance benefit, it's still interesting.

Even though this tangent is a big off-topic, it is often a great idea to push as much work into the database as you can if applicable. Especially if it means you can transfer a lot less data eventually.

Format thousands in Python

February 1, 2019
9 comments Python

tl;dr; Use f"{number:,}" to thousands format an integer to a string.

I keep forgetting and having to look this up every time. Hopefully by blogging about it, this time it'll stick in my memory. And hopefully in yours too :)

Suppose you have a number, like 1234567890 and you want to display it, here's how you do it:


>>> number = 1234567890
>>> f"{number:,}"
'1,234,567,890'

In the past, before Python 3.6, I've been using:


>>> number = 1234567890
>>> format(number, ",")
'1,234,567,890'

All of this and more detail can be found in PEP 378 -- Format Specifier for Thousands Separator. For example, you can do this beast too:


>>> number = 1234567890
>>> f"{number:020,.2f}"
'0,001,234,567,890.00'

which demonstrates (1) how to do zero-padding (of length 20), (2) the thousands comma, (3) round to 2 significant figures. All useful weapons to be able to draw from the top of your head.

UPDATE

Also, incredibly useful is the equivalent of somestring.ljust(10):


>>> mystr = "peter"
>>> f"{mystr:10}"
'peter     '
>>> f"{mystr:>10}"
'     peter'

hashin 0.14.5 and canonical pip hashes

January 31, 2019
0 comments Python

Prior to version 0.14.5 hashin would write write down the hashes of PyPI packages in the order they appear in PyPI's JSON response. That means there's a slight chance that two distinct clients/computers/humans might actually get different output when then run hashin Django==2.1.5.

The pull request has a pretty hefty explanation as it demonstrates the fix.

Do note that if the existing order of hashes in a requirements file is not in the "right" order, hashin won't correct it unless any of the hashes are different.

Thanks @SomberNight for patiently pushing for this.

variable_cache_control - Django view decorator to set max_age in runtime

January 22, 2019
0 comments Django, Python

tl;dr; If you use the django.views.decorators.cache.cache_control decorator, consider this one instead to change the max_age depending on the request.

I had/have a Django view function that looks something like this:


@cache_control(public=True, max_age=60 * 60)
def home(request, oc=None, page=1):
    ...

But, that number 60 * 60 I really needed it to be different depending on the request parameters. For example, that oc=None, if that's not None I know the page's Cache-Control header can and should be different.

So I wrote this decorator:


from django.utils.cache import patch_cache_control


def variable_cache_control(**kwargs):
    """Same as django.views.decorators.cache.cache_control except this one will
    allow the `max_age` parameter be a callable.
    """

    def _cache_controller(viewfunc):
        @functools.wraps(viewfunc)
        def _cache_controlled(request, *args, **kw):
            response = viewfunc(request, *args, **kw)
            copied = kwargs
            if kwargs.get("max_age") and callable(kwargs["max_age"]):
                max_age = kwargs["max_age"](request, *args, **kw)
                # Can't re-use, have to create a shallow clone.
                copied = dict(kwargs, max_age=max_age)
            patch_cache_control(response, **copied)
            return response

        return _cache_controlled

    return _cache_controller

Now, I can do this instead:


def _best_max_age(req, oc=None, **kwargs):
    max_age = 60 * 60
    if oc:
        max_age *= 10
    return max_age

@variable_cache_control(public=True, max_age=_best_max_age)
def home(request, oc=None, page=1):
    ...

I hope it inspires.

Use vars() to send an argparse Namespace into a function in Python

January 8, 2019
1 comment Python

I only just learned about this today after all these years and thought you might like it too.

The trick is to conveniently turn an argparse.Namespace into keyword arguments that you can send to a function. This is the old/wrong way I've been doing it for years:


# THE OLD WAY

def main(things, option_a, option_n):
    print(locals())  # Debugging 


import argparse

parser = argparse.ArgumentParser()
parser.add_argument("things", help="Bla bla", nargs="*")
parser.add_argument("-o", "--option-a", help="Bla bla", default="Op A")
parser.add_argument("-n", "--option-n", help="Ble ble", default="Op N")
args = parser.parse_args()

main(
    things=args.things,
    option_a=args.option_a,
    option_n=args.option_n
)

That works but the tedious thing is to have to have spell out every single argument, twice!, when sending the argparse Namespace into the function. Here's the much moar betterest way:


# THE NEW WAY

def main(things, option_a, option_n):
    print(locals())  # Debugging 


import argparse

parser = argparse.ArgumentParser()
parser.add_argument("things", help="Bla bla", nargs="*")
parser.add_argument("-o", "--option-a", help="Bla bla", default="Op A")
parser.add_argument("-n", "--option-n", help="Ble ble", default="Op N")
args = parser.parse_args()

# The only difference and the magic sauce...
main(**vars(args))  

What's neat about this is that you don't have to type up every argument defined in the parser to the get it as arguments into a function. And as a bonus, Python will name match keyword arguments to arguments so the order doesn't matter.

Caveat! This "trick" assumes that the arguments in the parser match the arguments in the function. So if the main() function takes an argument called foo_bar you have to have an argument in the parser called --foo-bar.