A blog and website by Peter Bengtsson

Filtered home page!
Currently only showing blog entries under the category: Python. Clear filter

cache_memoize - a pretty decent cache decorator for Django

11 September 2017 1 comment   Django, Web development, Python

This is something that's grown up organically when working on Mozilla Symbol Server. It has served me very well and perhaps it's worth extracting into its own lib.


Basically, you are probably used to this in Django:

from django.core.cache import cache

def compute_something(user, special=False):
    cache_key = 'meatycomputation:{}:special={}'.format(, special)
    value = cache.get(cache_key)
    if value is None:
        value = _call_the_meat(, special)  # some really slow function
        cache.set(cache_key, value, 60 * 5)
    return value

Here's instead how you can do exactly the same with cache_memoize:

from wherever.decorators import cache_memoize

@cache_memoize(60 * 5)
def compute_something(user, special=False):
    return _call_the_meat(, special)  # some really slow function

Cache invalidation

If you ever need to do non-trivial caching you know it's important to be able to invalidate the cache. Usually, to be able to do that you need to involved in how the cache key was created.

Consider our two examples above, here's first the common thing to do:

def save_user(user):

    cache_key = 'meatycomputation:{}:special={}'.format(, False)
    # And when it was special=True
    cache_key = 'meatycomputation:{}:special={}'.format(, True)

This works but it involves repeating the code that generates the cache key. You could extract that into its own function of course.

Here's how you do it with the cache_memoize decorator:

def save_user(user):

    compute_something.invalidate(user, special=False)
    compute_something.invalidate(user, special=True)    

Other features

There are actually two ways to "invalidate" the cache. Calling the new myoriginalfunction.invalidate(...) function or passing a custom extra keyword argument called _refresh. For example: compute_something(user, _refresh=True).

You can pass callables that get called when the cache works in your favor or when it's a cache miss. For example:

def increment_hits(user, special=None):
    # use your imagination

def cache_miss(user, special=None):
    print("cache miss on {}".format(

    60 * 5,
def compute_something(user, special=False):
    return _call_the_meat(, special)  # some really slow function

Sometimes you just want to use the memoizer to make sure something only gets called "once" (or once per time interval). In that case it might be smart to not flood your cache backend with the value of the function output if there is one. For example:

@cache_memoize(60 * 60, store_result=False)  # idempotent guard
def calculate_and_update(user):
    # do something expensive here that is best to only do once per hour

Internally cache_memoize will basically try to convert every argument and keyword argument to a string with, kinda, str(). That might not always be appropriate because you might know that you have two distinct objects whose __str__ will yield the same result. For that you can use the args_rewrite parameter. For example:

def simplify_special_objects(obj):
    # use your imagination
    return obj.hostname 

@cache_memoize(60 * 5, args_rewrite=simplify_special_objects)
def compute_something(special_obj):
    return _call_the_meat(special_obj.hostname)

In conclusion

I've uploaded the code as a gist.

It's quite possible that there's already a perfectly good lib that does exactly this. If so, thanks for letting me know. If not, perhaps I ought to wrap this up and publish it on PyPI. Again, that's for letting me know.

Mozilla Symbol Server (aka. Tecken) load testing

06 September 2017 0 comments   Mozilla, Django, Web development, Python

(Thanks Miles Crabil not only for being an awesome Ops person but also for reviewing this blog post!)

My project over the summer, here at Mozilla, has been a project called Mozilla Symbol Server. It's a web service that uploads C++ symbol files, downloads C++ symbol files and symbolicates C++ crash stacktraces. It went into production last week which was fun but there's still lots of work to do on adding beyond-parity features and more optimizations.

What Is Mozilla Symbol Server?

The code name for this project is Tecken and it's written in Python (Django, Gunicorn) and uses PostgreSQL, Redis and Celery. The frontend is entirely static and developed (almost) as a separate project within. The frontend is written in React (using create-react-app and react-router). Everything is run as Docker containers. And if you ask me more details about how it's configured/deployed I'm afraid I have to defer to the awesome Mozilla CloudOps team.

One the challenges I faces developing Tecken is that symbol downloads need to be fast to handle high volumes of traffic. Today I did some load testing on our stage deployment and managed to start 14 concurrent clients that bombarded our staging server with realistic HTTPS GET queries based on log files. It's actually 7 + 1 + 4 + 2 concurrent clients. 7 of them from a m3.2xlarge EC2 node (8 vCPUs), 1 from a m3.large EC2 node (1 vCPU), 2 from two separate NYC based DigitalOcean personal servers and 2 clients here from my laptop on my home broadband. Basically, each loadtest script process got its own CPU.

Total req/s
It's hard to know how much more each client could push if it wasn't slowed down. Either way, the server managed to sustain about 330 requests per second. Our production baseline goal is to able to handle at least 40 requests per second.

After running for a while the caches started getting warm but about 1-5% of requests do have to make a boto3 roundtrip to an S3 bucket located on the other side of America in Oregon. There is also a ~5% penalty in that some requests trigger a write to a central Redis ElastiCache server. That's cheaper than the boto3 S3 call but still hefty latency costs to pay.

The ELB in our staging environment spreads the load between 2 c4.large (2 vCPUs, 3.75GB RAM) EC2 web heads. Each running with preloaded Gunicorn workers between Nginx and Django. Each web head has its own local memcached server to share memory between each worker but only local to the web head.

Is this a lot?

How long is a rope? Hard to tell. Tecken's performance is certainly more than enough and by the sheer fact that it was only just production deployed last week tells me we can probably find a lot of low-hanging fruit optimizations on the deployment side over time.

One way of answering that is to compare it with our lightest endpoint. One that involves absolutely no external resources. It's just pure Python in the form of ELB → Nginx → Gunicorn → Django. If I run hey from the same server I did the load testing I get a topline of 1,300 requests per second.

$ hey -n 10000 -c 10
  Total:    7.6604 secs
  Slowest:  0.0610 secs
  Fastest:  0.0018 secs
  Average:  0.0075 secs
  Requests/sec: 1305.4199

That basically means that all the extra "stuff" (memcache key prep, memcache key queries and possible other high latency network requests) it needs to do in the Django view takes up roughly 3x the time it takes the absolute minimal Django request-response rendering.

Also, if I use the same technique to bombard a single URL, but one that actually involves most code steps but is definitely able to not require any slow ElastiCache writes or boto3 S3 reads you I get 800 requests per second:

$ hey -n 10000 -c 10
  Total:    12.4160 secs
  Slowest:  0.0651 secs
  Fastest:  0.0024 secs
  Average:  0.0122 secs
  Requests/sec: 805.4150
  Total data:   300000 bytes
  Size/request: 30 bytes

Lesson learned

Max CPU Used
It's a recurring reminder that performance is almost all about latency. If not RAM or disk it's networking. See the graph of the "Max CPU Used" which basically shows that CPU of user, system and stolen ("CPU spent waiting for the hypervisor to service another virtual CPU") never sum totalling over 50%.

Fastest way to match a filename's extension in Python

31 August 2017 2 comments   Python

tl;dr; By a slim margin, the fastest way to check a filename matching a list of extensions is filename.endswith(extensions)

This turned out to be premature optimization. The context is that I want to check if a filename matches the file extension in a list of 6.

The list being ['.sym', '.dl_', '.ex_', '.pd_', '.dbg.gz', '.tar.bz2']. Meaning, it should return True for foo.sym or foo.dbg.gz. But it should return False for bar.exe or bar.gz.

I put together a litte benchmark, ran it a bunch of times and looked at the results. Here are the functions I wrote:

def f1(filename):
    for each in extensions:
        if filename.endswith(each):
            return True
    return False

def f2(filename):
    return filename.endswith(extensions_tuple)

regex = re.compile(r'({})$'.format(
    '|'.join(re.escape(x) for x in extensions)

def f3(filename):
    return bool(regex.findall(filename))

def f4(filename):
    return bool(

The results are boring. But I guess that's a result too:

FUNCTION             MEDIAN               MEAN
f1 9543 times        0.0110ms             0.0116ms
f2 9523 times        0.0031ms             0.0034ms
f3 9560 times        0.0041ms             0.0045ms
f4 9509 times        0.0041ms             0.0043ms

For a list of ~40,000 realistic filenames (with result True 75% of the time), I ran each function 10 times. So, it means it took on average 0.0116ms to run f1 10 times here on my laptop with Python 3.6.

More premature optimization

Upon looking into the data and thinking about this will be used. If I reorder the list of extensions so the most common one is first, second most common second etc. Then the performance improves a bit for f1 but slows down slightly for f3 and f4.


That .endswith(some_tuple) is neat and it's hair-splittingly faster. But really, this turned out to not make a huge difference in the grand scheme of things. On average it takes less than 0.001ms to do one filename match.

Fastest *local* cache backend possible for Django

04 August 2017 7 comments   Django, Web development, Python

I did another couple of benchmarks of different cache backends in Django. This is an extension/update on Fastest cache backend possible for Django published a couple of months ago. This benchmarking isn't as elaborate as the last one. Fewer tests and fewer variables.

I have another app where I use a lot of caching. This web application will run its cache server on the same virtual machine. So no separation of cache server and web head(s). Just one Django server talking to localhost:11211 (memcached's default port) and localhost:6379 (Redis's default port).

Also in this benchmark, the keys were slightly smaller. To simulate my applications "realistic needs" I made the benchmark fall on roughly 80% cache hits and 20% cache misses. The cache keys were 1 to 3 characters long and the cache values lists of strings always 30 items long (e.g. len(['abc', 'def', 'cba', ... , 'cab']) == 30).

Also, in this benchmark I was too lazy to test all different parsers, serializers and compressors that django-redis supports. I only test python-memcached==1.58 versus django-redis==4.8.0 versus django-redis==4.8.0 && msgpack-python==0.4.8.

The results are quite "boring". There's basically not enough difference to matter.

Config Average Median Compared to fastest
memcache 4.51s 3.90s 100%
redis 5.41s 4.61s 84.7%
redis_msgpack 5.16s 4.40s 88.8%


As Hal pointed out in the comment, when you know the web server and the memcached server is on the same computer you should use UNIX sockets. They're "obviously" faster since the lack of HTTP overhead at the cost of it doesn't work over a network.

Because running memcached on a socket on OSX is a hassle I only have one benchmark. Note! This basically compares good old django.core.cache.backends.memcached.MemcachedCache with two different locations.

Config Average Median Compared to fastest 3.33s 3.34s 81.3%
unix:/tmp/memcached.sock 2.66s 2.71s 100%

But there's more! Another option is to use pylibmc which is a Python client written in C. By the way, my Python I use for these microbenchmarks is Python 3.5.

Unfortunately I'm too lazy/too busy to do a matrix comparison of pylibmc on TCP versus UNIX socket. Here are the comparison results of using python-memcached versus pylibmc:

Client Average Median Compared to fastest
python-memcached 3.52s 3.52s 62.9%
pylibmc 2.31s 2.22s 100%


Seems my luck someone else has done the matrix comparison of python-memcached vs pylibmc on TCP vs UNIX socket:

Find static files defined in django-pipeline but not found

25 July 2017 0 comments   Django, Python

If you're reading this you're probably familiar with how, in django-pipeline, you define bundles of static files to be combined and served. If you're not familiar with django-pipeline it's unlike this'll be of much help.

The Challenge (aka. the pitfall)

So you specify bundles by creating things in your something like this:

        'colors': {
            'source_filenames': (
            'output_filename': 'css/colors.css',
            'extra_context': {
                'media': 'screen,projection',
        'stats': {
            'source_filenames': (
            'output_filename': 'js/stats.js',

You do a bit more configuration and now, when you run ./ collectstatic --noinput Django and django-pipeline will gather up all static files from all Django apps installed, then start post processing then and doing things like concatenating them into one file and doing stuff like minification etc.

The problem is, if you look at the example snippet above, there's a typo. Instead of js/application.js it's accidentally js/aplication.js. Oh noes!!

What's sad is it that nobody will notice (running ./ collectstatic will exit with a 0). At least not unless you do some careful manual reviewing. Perhaps you will notice later, when you've pushed the site to prod, that the output file js/stats.js actually doesn't contain the code from js/application.js.

Or, you can automate it!

A Solution (aka. the hack)

I started this work this morning because the error actually happened to us. Thankfully not in production but our staging server produced a rendered HTML page with <link href="/static/css/report.min.cd784b4a5e2d.css" rel="stylesheet" type="text/css" /> which was an actual file but it was 0 bytes.

It wasn't that hard to figure out what the problem was because of the context of recent changes but it would have been nice to catch this during continuous integration.

So what we did was add an extra class to settings.STATICFILES_FINDERS called myproject.base.finders.LeftoverPipelineFinder. So now it looks like this:

# in

    'myproject.finders.LeftoverPipelineFinder',  # the new hotness!

And here's the class implementation:

from pipeline.finders import PipelineFinder

from django.conf import settings
from django.core.exceptions import ImproperlyConfigured

class LeftoverPipelineFinder(PipelineFinder):
    """This finder is expected to come AFTER 
    django.contrib.staticfiles.finders.FileSystemFinder and 
    django.contrib.staticfiles.finders.AppDirectoriesFinder in 
    If a path is looked for here it means it's trying to find a file
    that none of the regular staticfiles finders couldn't find.
    def find(self, path, all=False):
        # Before we raise an error, try to find out where,
        # in the bundles, this was defined. This will make it easier to correct
        # the mistake.
        for config_name in 'STYLESHEETS', 'JAVASCRIPT':
            config = settings.PIPELINE[config_name]
            for key in config:
                if path in config[key]['source_filenames']:
                    raise ImproperlyConfigured(
                        'Static file {!r} can not be found anywhere. Defined in '
        # If the file can't be found AND it's not in bundles, there's
        # got to be something else really wrong.
        raise NotImplementedError(path)

Now, if you have a typo or something in your bundles, you'll get a nice error about it as soon as you try to run collectstatic. For example:

▶ ./ collectstatic --noinput
Post-processed 'css/search.min.css' as 'css/search.min.css'
Post-processed 'css/base.min.css' as 'css/base.min.css'
Post-processed 'css/base-dynamic.min.css' as 'css/base-dynamic.min.css'
Post-processed 'js/google-analytics.min.js' as 'js/google-analytics.min.js'
Traceback (most recent call last):
django.core.exceptions.ImproperlyConfigured: Static file 'js/aplication.js' can not be found anywhere. Defined in PIPELINE['JAVASCRIPT']['stats']['source_filenames']

Final Thoughts

This was a morning hack. I'm still not entirely sure if this the best approach, but there was none better and the result is pretty good.

We run ./ collectstatic --noinput in our continous integration just before it runs ./ test. So if you make a Pull Request that has a typo in it will get caught.

Unfortunately, it won't find missing files if you use foo*.js or something like that. django-pipeline uses glob.glob to convert expressions like that into a list of actual files and that depends on the filesystem and all of that happens before the django.contrib.staticfiles.finders.find function is called.

If you have any better suggestions to solve this, please let me know.

How to do performance micro benchmarks in Python

24 June 2017 3 comments   Python

Suppose that you have a function and you wonder, "Can I make this faster?" Well, you might already have thought that and you might already have a theory. Or two. Or three. Your theory might be sound and likely to be right, but before you go anywhere with it you need to benchmark it first. Here are some tips and scaffolding for doing Python function benchmark comparisons.


  1. Internally, Python will warm up and it's likely that your function depends on other things such as databases or IO. So it's important that you don't test function1 first and then function2 immediately after because function2 might benefit from a warm up painfully paid for by function1. So mix up the order of them or cycle through them enough that they all pay for or gain from warm ups.

  2. Look at the median first. The mean (aka. average) is often tainted by spikes and these spikes of slow-down can be caused by your local Spotify client deciding to reindex itself or something some such. Sometimes those spikes matter. For example, garbage collection is inevitable and will have an effect that matters.

  3. Run your functions many times. So many times that the whole benchmark takes a while. Like tens of seconds or more. Also, if you run it significantly long it's likely that all candidates gets punished by the same environmental effects such as garbage collection or CPU being reassinged to something else intensive on your computer.

  4. Try to take your benchmark into different, and possibly more realistic environments. For example, don't rely on reading a file like /Users/peterbe/only/on/my/macbook when, likely, the end destination for your code is an Ubuntu server in AWS. Write your code so that it's easy to copy and paste around, like into a vi/jed editor in an ssh session somewhere.

  5. Sanity check each function before benchmarking them. No need for pytest or anything fancy but just make sure that you test them in some basic way. But the assertion testing is likely to add to the total execution time so don't do it when running the functions.

  6. Avoid "prints" inside the time measured code. A print() is I/O and an "external resource" that can become very unfair to compare CPU bound performance.

  7. Don't fear testing many different functions. If you have multiple ideas of doing a function differently, it's cheap to pile them on. But be careful how you "report" because if there are many different ways of doing something you might accidentally compare different fruit without noticing.

  8. Make sure your functions take at least one parameter. I'm no Python core developer or C hacker but I know there are "murks" within a compiler and interpreter that might do what a regular memoizer might done. Also, the performance difference can be reversed on tiny inputs compared to really large ones.

  9. Be humble with the fact that 0.01 milliseconds difference when doing 10,000 iterations is probably not worth writing a more complex and harder-to-debug function.

The Boilerplate

Let's demonstrate with an example:

# The functions to compare
import math

def f1(degrees):
    return math.cos(degrees)

def f2(degrees):
    e = 2.718281828459045
    return (
        (e**(degrees * 1j) + e**-(degrees * 1j)) / 2

# Assertions
assert f1(100) == f2(100) == 0.862318872287684
assert f1(1) == f2(1) == 0.5403023058681398

# Reporting
import time
import random
import statistics

functions = f1, f2
times = {f.__name__: [] for f in functions}

for i in range(100000):  # adjust accordingly so whole thing takes a few sec
    func = random.choice(functions)
    t0 = time.time()
    t1 = time.time()
    times[func.__name__].append((t1 - t0) * 1000)

for name, numbers in times.items():
    print('FUNCTION:', name, 'Used', len(numbers), 'times')
    print('\tMEDIAN', statistics.median(numbers))
    print('\tMEAN  ', statistics.mean(numbers))
    print('\tSTDEV ', statistics.stdev(numbers))

Let's break that down a bit.

You run that and get something like this:

FUNCTION: f1 Used 49990 times
    MEDIAN 0.0
    MEAN   0.00045161219591330375
    STDEV  0.0011268475946446341
FUNCTION: f2 Used 50010 times
    MEDIAN 0.00095367431640625
    MEAN   0.0009188626294516487
    STDEV  0.000642871632138125

More Examples

The example above already broke one of the tenets in that these functions were simply too fast. Doing rather basic mathematics is just too fast to compare with such a trivial benchmark. Here are some other examples:

Remove duplicates from list without losing order

# The functions to compare

def f1(seq):
    checked = []
    for e in seq:
        if e not in checked:
    return checked

def f2(seq):
    checked = []
    seen = set()
    for e in seq:
        if e not in seen:
    return checked

def f3(seq):
    checked = []
    [checked.append(i) for i in seq if not checked.count(i)]
    return checked

def f4(seq):
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]

def f5(seq):
    def generator():
        seen = set()
        for x in seq:
            if x not in seen:
                yield x

    return list(generator())

# Assertion
import random

def _random_seq(length):
    seq = []
    for _ in range(length):
    return seq

L = list('abca')
assert f1(L) == f2(L) == f3(L) == f4(L) == f5(L) == list('abc')
L = _random_seq(10)
assert f1(L) == f2(L) == f3(L) == f4(L) == f5(L)

# Reporting
import time
import statistics

functions = f1, f2, f3, f4, f5
times = {f.__name__: [] for f in functions}

for i in range(3000):
    seq = _random_seq(i)
    for _ in range(len(functions)):
        func = random.choice(functions)
        t0 = time.time()
        t1 = time.time()
        times[func.__name__].append((t1 - t0) * 1000)

for name, numbers in times.items():
    print('FUNCTION:', name, 'Used', len(numbers), 'times')
    print('\tMEDIAN', statistics.median(numbers))
    print('\tMEAN  ', statistics.mean(numbers))
    print('\tSTDEV ', statistics.stdev(numbers))


FUNCTION: f1 Used 3029 times
    MEDIAN 0.6871223449707031
    MEAN   0.6917867380307822
    STDEV  0.42611748137761174
FUNCTION: f2 Used 2912 times
    MEDIAN 0.054955482482910156
    MEAN   0.05610262627130026
    STDEV  0.03000829926668248
FUNCTION: f3 Used 2985 times
    MEDIAN 1.4472007751464844
    MEAN   1.4371055654145566
    STDEV  0.888658217522005
FUNCTION: f4 Used 2965 times
    MEDIAN 0.051975250244140625
    MEAN   0.05343245816673035
    STDEV  0.02957275548477728
FUNCTION: f5 Used 3109 times
    MEDIAN 0.05507469177246094
    MEAN   0.05678296204202234
    STDEV  0.031521596461048934


def f4(seq):
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]

Fastest way to count the number of lines in a file

# The functions to compare
import codecs
import subprocess

def f1(filename):
    count = 0
    with, encoding='utf-8', errors='ignore') as f:
        for line in f:
            count += 1
    return count

def f2(filename):
    with, encoding='utf-8', errors='ignore') as f:
        return len(

def f3(filename):
    return int(subprocess.check_output(['wc', '-l', filename]).split()[0])

# Assertion
filename = 'big.csv'
assert f1(filename) == f2(filename) == f3(filename) == 9999

# Reporting
import time
import statistics
import random

functions = f1, f2, f3
times = {f.__name__: [] for f in functions}

filenames = '', 'hacker_news_data.txt', 'yarn.lock', 'big.csv'
for _ in range(200):
    for fn in filenames:
        for func in functions:
            t0 = time.time()
            t1 = time.time()
            times[func.__name__].append((t1 - t0) * 1000)

for name, numbers in times.items():
    print('FUNCTION:', name, 'Used', len(numbers), 'times')
    print('\tMEDIAN', statistics.median(numbers))
    print('\tMEAN  ', statistics.mean(numbers))
    print('\tSTDEV ', statistics.stdev(numbers))


FUNCTION: f1 Used 800 times
    MEDIAN 5.852460861206055
    MEAN   25.403797328472137
    STDEV  37.09347378640582
FUNCTION: f2 Used 800 times
    MEDIAN 0.45299530029296875
    MEAN   2.4077045917510986
    STDEV  3.717931526478758
FUNCTION: f3 Used 800 times
    MEDIAN 2.8804540634155273
    MEAN   3.4988239407539368
    STDEV  1.3336427480808102


def f2(filename):
    with, encoding='utf-8', errors='ignore') as f:
        return len(


No conclusion really. Just wanted to point out that this is just a hint of a decent start when doing performance benchmarking of functions.

There is also the timeit built-in for "provides a simple way to time small bits of Python code" but it has the disadvantage that your functions are not allowed to be as complex. Also, it's harder to generate multiple different fixtures to feed your functions without that fixture generation effecting the times.

There's a lot of things that this boilerplate can improve such as sorting by winner, showing percentages comparisons against the fastests, ASCII graphs, memory allocation differences, etc. That's up to you.

Fastest way to find out if a file exists in S3 (with boto3)

16 June 2017 3 comments   Web development, Python

tl;dr; It's faster to list objects with prefix being the full key path, than to use HEAD to find out of a object is in an S3 bucket.


I have a piece of code that opens up a user uploaded .zip file and extracts its content. Then it uploads each file into an AWS S3 bucket if the file size is different or if the file didn't exist at all before.

It looks like this:

for filename, filesize, fileobj in extract(zip_file):
    size = _size_in_s3(bucket, filename)
    if size is None or size != filesize:
        upload_to_s3(bucket, filename, fileobj)
        print('Updated!' if size else 'New!')

I'm using the boto3 S3 client so there are two ways to ask if the object exists and get its metadata.

Option 1: client.head_object

Option 2: client.list_objects_v2 with Prefix=${keyname}.

But why the two different approaches?

The problem with client.head_object is that it's odd in how it works. Sane but odd. If the object does not exist, boto3 raises a botocore.exceptions.ClientError which contains a response and in it you can look for exception.response['Error']['Code'] == '404'.

What I noticed was that if you use a try:except ClientError: approach to figure out if an object exists, you reset the client's connection pool in urllib3. So after an exception has happened, any other operations on the client causes it to have to, internally, create a new HTTPS connection. That can cost time.

I wrote and filed this issue on

So I wrote two different functions to return an object's size if it exists:

def _key_existing_size__head(client, bucket, key):
    """return the key's size if it exist, else None"""
        obj = client.head_object(Bucket=bucket, Key=key)
        return obj['ContentLength']
    except ClientError as exc:
        if exc.response['Error']['Code'] != '404':

And the contender...:

def _key_existing_size__list(client, bucket, key):
    """return the key's size if it exist, else None"""
    response = client.list_objects_v2(
    for obj in response.get('Contents', []):
        if obj['Key'] == key:
            return obj['Size']

They both work. That was easy to test. But which is fastest?

Before we begin, which do you think is fastest? The head_object feels like it'll be able to send an operation to S3 internally to do a key lookup directly. But S3 isn't a normal database.

Here's the script partially cleaned up but should be easy to run.

The results

So I wrote a loop that ran 1,000 times and I made sure the bucket was empty so that 1,000 times the result of the iteration is that it sees that the file doesn't exist and it has to do a client.put_object.

Here are the results:

FUNCTION: _key_existing_size__list Used 511 times
    SUM    148.2740752696991
    MEAN   0.2901645308604679
    MEDIAN 0.2569708824157715
    STDEV  0.17742598775696436

FUNCTION: _key_existing_size__head Used 489 times
    SUM    249.79622673988342
    MEAN   0.510830729529414
    MEDIAN 0.4780092239379883
    STDEV  0.14352671121877011

Because it's network bound, it's really important to avoid the 'MEAN' and instead look at the 'MEDIAN'. My home broadband can cause temporary spikes.

Clearly, using client.list_objects_v2 is faster. It's 90% faster than client.head_object.

But note! this was 1,000 times of B) "does the file already exist?" and B) "No? Ok upload it". So the times there include all the client.put_object calls.

So why did I measure both? I.e. _key_existing_size__list+client.put_object versus. _key_existing_size__head+client.put_object? The reason is that the approach of using try:except ClientError: followed by a client.put_object causes boto3 to create a new HTTPS connection in its pool. Again, see the issue which demonstrates this in different words.

What if the object always exists?

So, I simply run the benchmark again. The first time, it uploaded all 1,000 uniquely named objects. So running it a second time, every time the answer is that the object exists, and its size hasn't changed, so it never triggers the client.put_object.

Here are the results this time:

FUNCTION: _key_existing_size__list Used 495 times
    SUM    54.60546112060547
    MEAN   0.11031406286991004
    MEDIAN 0.08583354949951172
    STDEV  0.06339202669609442

FUNCTION: _key_existing_size__head Used 505 times
    SUM    44.59347581863403
    MEAN   0.0883039125121466
    MEDIAN 0.07310152053833008
    STDEV  0.054452842190700346

In this case, using client.head_object is faster. By 20% but the median time is 0.08 seconds! Even on a home broadband connection. In other words, I don't think that difference is significant.

One more time, excluding the client.put_object

The point of using client.list_objects_v2 instead of client.head_object was to avoid breaking the connection pool in urllib3 that boto3 manages somehow. Having to create a new HTTPS connection (and adding it to the pool) costs time, but what if we disregard that and compare the two functions "purely" on how long they take when the file does NOT exist? Remember, the second measurement above was when every object exists.

So we know it took 0.09 seconds and 0.07 seconds respectively for the two functions to figure out that the object does exist. How long does it take to figure out that the object does not exist independent of any other op. I.e. just try each one without doing a client.put_object afterwards. That means we avoid the bug so the comparison is fair.

The results:

FUNCTION: _key_existing_size__list Used 499 times
    SUM    123.57429671287537
    MEAN   0.247643881188127
    MEDIAN 0.2196049690246582
    STDEV  0.18622877427652743

FUNCTION: _key_existing_size__head Used 501 times
    SUM    112.99495434761047
    MEAN   0.22553883103315464
    MEDIAN 0.2828958034515381
    STDEV  0.15342842113446084

The client.list_objects_v2 beats client.head_object by 30%. And it matters. Above I said that 20% difference didn't matter but now it does. That's because the time difference when it always finds the object was 0.013 seconds. When it comes to figuring out that the object did not exist the time difference is 0.063 seconds. That's still a pretty small number but, hey, you gotto draw the line somewhere.

In conclusion

Using client.list_objects_v2 is a better alternative to using client.head_object.

If you think you'll often find that the object doesn't exist and needs a client.put_object then using client.list_objects_v2 is 90% faster. If you think you'll rarely need client.put_object (i.e. that most objects don't change) then client.list_objects_v2 is almost the same performance.

Fastest Redis configuration for Django

11 May 2017 0 comments   Django, Web development, Linux, Python

I have an app that does a lot of Redis queries. It all runs in AWS with ElastiCache Redis. Due to the nature of the app, it stores really large hash tables in Redis. The application then depends on querying Redis for these. The question is; What is the best configuration possible for the fastest service possible?

Note! Last month I wrote Fastest cache backend possible for Django which looked at comparing Redis against Memcache. Might be an interesting read too if you're not sold on Redis.


All options are variations on the compressor, serializer and parser which are things you can override in django-redis. All have an effect on the performance. Even compression, for if the number of bytes between Redis and the application is smaller, then it should have better network throughput.

Without further ado, here are the variations:

    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://') + '/0',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
    "json": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://') + '/1',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "SERIALIZER": "django_redis.serializers.json.JSONSerializer",
    "ujson": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://') + '/2',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "SERIALIZER": "fastestcache.ujson_serializer.UJSONSerializer",
    "msgpack": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://') + '/3',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "SERIALIZER": "django_redis.serializers.msgpack.MSGPackSerializer",
    "hires": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://') + '/4',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "PARSER_CLASS": "redis.connection.HiredisParser",
    "zlib": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://') + '/5',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "COMPRESSOR": "django_redis.compressors.zlib.ZlibCompressor",
    "lzma": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://') + '/6',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "COMPRESSOR": "django_redis.compressors.lzma.LzmaCompressor"

As you can see, they each have a variation on the OPTIONS.PARSER_CLASS, OPTIONS.SERIALIZER or OPTIONS.COMPRESSOR.

The default configuration is to use redis-py and to pickle the Python objects to a bytestring. Pickling in Python is pretty fast but it has the disadvantage that it's Python specific so you can't have a Ruby application reading the same Redis database.

The Experiment

Note how I have one LOCATION per configuration. That's crucial for the sake of testing. That way one database is all JSON and another is all gzip etc.

What the benchmark does is that it measures how long it takes to READ a specific key (called benchmarking). Then, once it's done that it appends that time to the previous value (or [] if it was the first time). And lastly it writes that list back into the database. That way, towards the end you have 1 key whose value looks something like this: [0.013103008270263672, 0.003879070281982422, 0.009411096572875977, 0.0009970664978027344, 0.0002830028533935547, ..... MANY MORE ....].

Towards the end, each of these lists are pretty big. About 500 to 1,000 depending on the benchmark run.

In the experiment I used wrk to basically bombard the Django server on the URL /random (which makes a measurement with a random configuration). On the EC2 experiment node, it finalizes around 1,300 requests per second which is a decent number for an application that does a fair amount of writes.

The way I run the Django server is with uwsgi like this:

uwsgi --http :8000 --wsgi-file fastestcache/ --master --processes 4 --threads 2

And the wrk command like this:

wrk -d30s  ""

(that, by default, runs 2 threads on 10 connections)

At the end of starting the benchmarking, I open http://localhost:8000/summary which spits out a table and some simple charts.

An Important Quirk

Time measurements over time
One thing I noticed when I started was that the final numbers' average was very different from the medians. That would indicate that there are spikes. The graph on the right shows the times put into that huge Python list for the default configuration for the first 200 measurements. Note that there are little spikes but generally quite flat over time once it gets past the beginning.

Sure enough, it turns out that in almost all configurations, the time it takes to make the query in the beginning is almost order of magnitude slower than the times once the benchmark has started running for a while.

So in the test code you'll see that it chops off the first 10 times. Perhaps it should be more than 10. After all, if you don't like the spikes you can simply look at the median as the best source of conclusive truth.

The Code

The benchmarking code is here. Please be aware that this is quite rough. I'm sure there are many things that can be improved, but I'm not sure I'm going to keep this around.

The Equipment

The ElastiCache Redis I used was a cache.m3.xlarge (13 GiB, High network performance) with 0 shards and 1 node and no multi-zone enabled.

The EC2 node was a m4.xlarge Ubuntu 16.04 64-bit (4 vCPUs and 16 GiB RAM with High network performance).

Both the Redis and the EC2 were run in us-west-1c (North Virginia).

The Results

Here are the results! Sorry if it looks terrible on mobile devices.

root@ip-172-31-2-61:~# wrk -d30s  "" && curl ""
Running 30s test @
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.19ms    6.32ms  60.14ms   80.12%
    Req/Sec   583.94    205.60     1.34k    76.50%
  34902 requests in 30.03s, 2.59MB read
Requests/sec:   1162.12
Transfer/sec:     88.23KB
                         TIMES        AVERAGE         MEDIAN         STDDEV
json                      2629        2.596ms        2.159ms        1.969ms
msgpack                   3889        1.531ms        0.830ms        1.855ms
lzma                      1799        2.001ms        1.261ms        2.067ms
default                   3849        1.529ms        0.894ms        1.716ms
zlib                      3211        1.622ms        0.898ms        1.881ms
ujson                     3715        1.668ms        0.979ms        1.894ms
hires                     3791        1.531ms        0.879ms        1.800ms

Best Averages (shorter better)
██████████████████████████████████████████████████████████████   2.596  json
█████████████████████████████████████                            1.531  msgpack
████████████████████████████████████████████████                 2.001  lzma
█████████████████████████████████████                            1.529  default
███████████████████████████████████████                          1.622  zlib
████████████████████████████████████████                         1.668  ujson
█████████████████████████████████████                            1.531  hires
Best Medians (shorter better)
███████████████████████████████████████████████████████████████  2.159  json
████████████████████████                                         0.830  msgpack
████████████████████████████████████                             1.261  lzma
██████████████████████████                                       0.894  default
██████████████████████████                                       0.898  zlib
████████████████████████████                                     0.979  ujson
█████████████████████████                                        0.879  hires

Size of Data Saved (shorter better)
█████████████████████████████████████████████████████████████████  60K  json
██████████████████████████████████████                             35K  msgpack
████                                                                4K  lzma
█████████████████████████████████████                              35K  default
█████████                                                           9K  zlib
████████████████████████████████████████████████████               48K  ujson
█████████████████████████████████████                              34K  hires

Discussion Points


This experiment has lead me to the conclusion that the best serializer is msgpack and the best compression is zlib. That is the best configuration for django-redis.

msgpack has implementation libraries for many other programming languages. Right now that doesn't matter for my application but if msgpack is both faster and more versatile (because it supports multiple languages) I conclude that to be the best serializer instead.

Best practice with retries with requests

19 April 2017 3 comments   Python

tl;dr; I have a lot of code that does response = requests.get(...) in various Python projects. This is nice and simple but the problem is that networks are unreliable. So it's a good idea to wrap these network calls with retries. Here's one such implementation.

The First Hack

import time
import requests


def get(url):
        return requests.get(url)
    except Exception:
        # sleep for a bit in case that helps
        # try again
        return get(url)

This, above, is a terrible solution. It might fail for sooo many reasons. For example SSL errors due to missing Python libraries. Or the URL might have a typo in it, like get('http:/').

Also, perhaps it did work but the response is a 500 error from the server and you know that if you just tried again, the problem would go away.


while True:
    response = get('')
    if response.status_code != 500:
        # Hope it won't 500 a little later

What we need is a solution that does this right. Both for 500 errors and for various network errors.

The Solution

Here's what I propose:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def requests_retry_session(
    status_forcelist=(500, 502, 504),
    session = session or requests.Session()
    retry = Retry(
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session

Usage example...

response = requests_retry_session().get('')

s = requests.Session()
s.auth = ('user', 'pass')
s.headers.update({'x-test': 'true'})

response = requests_retry_session(sesssion=s).get(

It's an opinionated solution but by its existence it demonstrates how it works so you can copy and modify it.

Testing The Solution

Suppose you try to connect to a URL that will definitely never work, like this:

t0 = time.time()
    response = requests_retry_session().get(
except Exception as x:
    print('It failed :(', x.__class__.__name__)
    print('It eventually worked', response.status_code)
    t1 = time.time()
    print('Took', t1 - t0, 'seconds')

There is no server running in :9999 here on localhost. So the outcome of this is...

It failed :( ConnectionError
Took 1.8215010166168213 seconds


1.8 = 0 + 0.6 + 1.2

The algorithm for that backoff is documented here and it says:

A backoff factor to apply between attempts after the second try (most errors are resolved immediately by a second try without a delay). urllib3 will sleep for: {backoff factor} * (2 ^ ({number of total retries} - 1)) seconds. If the backoff_factor is 0.1, then sleep() will sleep for [0.0s, 0.2s, 0.4s, ...] between retries. It will never be longer than Retry.BACKOFF_MAX. By default, backoff is disabled (set to 0).

It does 3 retry attempts, after the first failure, with a backoff sleep escalation of: 0.6s, 1.2s.
So if the server never responds at all, after a total of ~1.8 seconds it will raise an error:

In this example, the simulation is matching the expectations (1.82 seconds) because my laptop's DNS lookup is near instant for localhost. If it had to do a DNS lookup, it'd potentially be slightly more on the first failure.

Works In Conjunction With timeout

Timeout configuration is not something you set up in the session. It's done on a per-request basis. httpbin makes this easy to test. With a sleep delay of 10 seconds it will never work (with a timeout of 5 seconds) but it does use the timeout this time. Same code as above but with a 5 second timeout:

t0 = time.time()
    response = requests_retry_session().get(
except Exception as x:
    print('It failed :(', x.__class__.__name__)
    print('It eventually worked', response.status_code)
    t1 = time.time()
    print('Took', t1 - t0, 'seconds')

And the output of this is:

It failed :( ConnectionError
Took 21.829053163528442 seconds

That makes sense. Same backoff algorithm as before but now with 5 seconds for each attempt:

21.8 = 5 + 0 + 5 + 0.6 + 5 + 1.2 + 5

Works For 500ish Errors Too

This time, let's run into a 500 error:

t0 = time.time()
    response = requests_retry_session().get(
except Exception as x:
    print('It failed :(', x.__class__.__name__)
    print('It eventually worked', response.status_code)
    t1 = time.time()
    print('Took', t1 - t0, 'seconds')

The output becomes:

It failed :( RetryError
Took 2.353440046310425 seconds

Here, the reason the total time is 2.35 seconds and not the expected 1.8 is because there's a delay between my laptop and I tested with a local Flask server to do the same thing and then it took a total of 1.8 seconds.


Yes, this suggested implementation is very opinionated. But when you've understood how it works, understood your choices and have the documentation at hand you can easily implement your own solution.

Personally, I'm trying to replace all my requests.get(...) with requests_retry_session().get(...) and when I'm making this change I make sure I set a timeout on the .get() too.

The choice to consider a 500, 502 and 504 errors "retry'able" is actually very arbitrary. It totally depends on what kind of service you're reaching for. Some services only return 500'ish errors if something really is broken and is likely to stay like that for a long time. But this day and age, with load balancers protecting a cluster of web heads, a lot of 500 errors are just temporary. Obivously, if you're trying to do something very specific like requests_retry_session().post(...) with very specific parameters you probably don't want to retry on 5xx errors.

A decent Elasticsearch search engine implementation

09 April 2017 0 comments   Django, Web development, Python

The title is a bit of an understatement because I think it's pretty good. It's not perfect and it's not guaranteed to scale, but it works pretty well. Especially on search term typos.

This, my own blog, now has a search engine built with Elasticsearch using the Python library elasticsearch-dsl. The algorithm (if you can call it that) is my own afternoon hack invention. Before I explain how it works try out a couple of searches:

Try a couple of searches:

(each search appends &debug-search for extended output)

Also, by default it uses Elasticsearch's match_phrase so when you search for a multi-word thing, it requires a match on each term. E.g. date format which finds Date formatting, date formats etc.

But if you search for something where the whole phrase can't match, it splits up the search an uses a match operator instead (minus any stop words).


This solution is very much focussed on typos. One thing I really dislike in non-Google search engines is when you make a search and nothing is found and it says "Did you mean ...?". Quite likely I did, but why do I have to click it? Can't it just be clicked for me?

Also, if there's ambiguity and possibly some results based on what you typed and multiple potential "Did you mean...?". Why not just blend them alltogether like Google does? Here is my attempt to solve that. Come with me...

Figuring Out ALL Search Terms

So if you type "Firefix" (not "Firefox", also scroll to the bottom to see the debug table) then maybe, that's an actual word that might be in the database. Then by using the Elasticsearch's Suggesters it figures out alternative spellings based on frequency distributions within the indexed content. This lookup is actually really fast. So now it figures out three alternative ways to spell this term:

And, very arbitrarily I pick a score for the default term that the user typed in. Let's pick 1.1. Doesn't matter gravely and it's up for future tuning. The initial goal is to not bury this spelling alternative too far back.

Here's how to run the suggester for every defined doc type and generate a list of other search terms tuples (minimum score >=0.6).

search_terms = [(1.1, q)]
_search_terms = set([q])
doc_type_keys = (
    (BlogItemDoc, ('title', 'text')),
    (BlogCommentDoc, ('comment',)),
for doc_type, keys in doc_type_keys:
    suggester =
    for key in keys:
        suggester = suggester.suggest('sugg', q, term={'field': key})
    suggestions = suggester.execute_suggest()
    for each in suggestions.sugg:
        if each.options:
            for option in each.options:
                if option.score >= 0.6:
                    better = q.replace(each['text'], option['text'])
                    if better not in _search_terms:

Eventually we get a list (once sorted) that looks like this:

search_terms = [(1.1 'firefix'), (0.9, 'firefox'), (0.7, 'firefli'), (0.7, 'firfox')]

The only reason the code sorts this by the score is in case there are crazy-many search terms. Then we might want to chop off some and only use the 5 highest scoring spelling alternatives.

Building The Boosted OR-query

In this scenario, we're searching amongst blog posts. The title is likely to be a better match than the body. If the title mentions it we probably want to favor that over those where it's only mentioned in the body.

So to build up the OR-query we'll boost the title more than the body ("text" in this example) and we'll build it up using all possible search terms and boost them based on their score. Here's the complete query.

strategy = 'match_phrase'
if original_q:
    strategy = 'match'
search_term_boosts = {}
for i, (score, word) in enumerate(search_terms):
    # meaning the first search_term should be boosted most
    j = len(search_terms) - i
    boost = 1 * j * score
    boost_title = 2 * boost
    search_term_boosts[word] = (boost_title, boost)
    match = Q(strategy, title={
        'query': word,
        'boost': boost_title,
    }) | Q(strategy, text={
        'query': word,
        'boost': boost,
    if matcher is None:
        matcher = match
        matcher |= match

search_query = search_query.query(matcher)

The core is that it does Q('match_phrase' title='firefix', boost=2X) | Q('match_phrase', text='firefix', boost=X).

Here's another arbitrary number. The number 2. It means that the "title" is 2 times more important than the "text".

And that's it! Now every match is scored based on how suggester's score and whether it be matched on the "title" or the "text" (or both). Elasticsearch takes care of everything else. The default is to sort by the _score as ultimately dictated by Lucene.

Match Phrase or Match

In this implementation it tries to match using a match phrase query which basically tries to find matches where every word in the query matches.

The cheap solution here is to basically keep whole search function as is, but if absolutely nothing is found with a match_phrase, and there were multiple words, then just recurse over one more time and do it with a match query instead.

This could probably be improved and do the match_phrase first with higher boost and do the match too but with a lower boost. All in one big query.

Want A Copy?

Note, this copy is quite a mess! It's a personal side-project which is an excuse for experimentation and goofing around.

The full search function is here.

Please don't judge me for the scrappiness of the code but please share your thoughts on this being a decent application of Elasticsearch for smallish datasets like a blog.