Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page!
Currently only showing blog entries under the category: Django. Clear filter

cache_memoize - a pretty decent cache decorator for Django

11 September 2017 1 comment   Django, Web development, Python

https://gist.github.com/peterbe/fd6ffc23325df849b27c549e769ce570


This is something that's grown up organically when working on Mozilla Symbol Server. It has served me very well and perhaps it's worth extracting into its own lib.

Usage

Basically, you are probably used to this in Django:

from django.core.cache import cache

def compute_something(user, special=False):
    cache_key = 'meatycomputation:{}:special={}'.format(user.id, special)
    value = cache.get(cache_key)
    if value is None:
        value = _call_the_meat(user.id, special)  # some really slow function
        cache.set(cache_key, value, 60 * 5)
    return value

Here's instead how you can do exactly the same with cache_memoize:

from wherever.decorators import cache_memoize

@cache_memoize(60 * 5)
def compute_something(user, special=False):
    return _call_the_meat(user.id, special)  # some really slow function

Cache invalidation

If you ever need to do non-trivial caching you know it's important to be able to invalidate the cache. Usually, to be able to do that you need to involved in how the cache key was created.

Consider our two examples above, here's first the common thing to do:

def save_user(user):
    do_something_that_will_need_to_cache_invalidate(user)

    cache_key = 'meatycomputation:{}:special={}'.format(user.id, False)
    cache.delete(cache_key)
    # And when it was special=True
    cache_key = 'meatycomputation:{}:special={}'.format(user.id, True)
    cache.delete(cache_key)

This works but it involves repeating the code that generates the cache key. You could extract that into its own function of course.

Here's how you do it with the cache_memoize decorator:

def save_user(user):
    do_something_that_will_need_to_cache_invalidate(user)

    compute_something.invalidate(user, special=False)
    compute_something.invalidate(user, special=True)    

Other features

There are actually two ways to "invalidate" the cache. Calling the new myoriginalfunction.invalidate(...) function or passing a custom extra keyword argument called _refresh. For example: compute_something(user, _refresh=True).

You can pass callables that get called when the cache works in your favor or when it's a cache miss. For example:

def increment_hits(user, special=None):
    # use your imagination
    metrics.incr(user.email)

def cache_miss(user, special=None):
    print("cache miss on {}".format(user.email))

@cache_memoize(
    60 * 5,
    hit_callable=increment_hits,
    miss_callable=cache_miss,
)
def compute_something(user, special=False):
    return _call_the_meat(user.id, special)  # some really slow function

Sometimes you just want to use the memoizer to make sure something only gets called "once" (or once per time interval). In that case it might be smart to not flood your cache backend with the value of the function output if there is one. For example:

@cache_memoize(60 * 60, store_result=False)  # idempotent guard
def calculate_and_update(user):
    # do something expensive here that is best to only do once per hour

Internally cache_memoize will basically try to convert every argument and keyword argument to a string with, kinda, str(). That might not always be appropriate because you might know that you have two distinct objects whose __str__ will yield the same result. For that you can use the args_rewrite parameter. For example:

def simplify_special_objects(obj):
    # use your imagination
    return obj.hostname 

@cache_memoize(60 * 5, args_rewrite=simplify_special_objects)
def compute_something(special_obj):
    return _call_the_meat(special_obj.hostname)

In conclusion

I've uploaded the code as a gist.

It's quite possible that there's already a perfectly good lib that does exactly this. If so, thanks for letting me know. If not, perhaps I ought to wrap this up and publish it on PyPI. Again, that's for letting me know.

Mozilla Symbol Server (aka. Tecken) load testing

06 September 2017 0 comments   Mozilla, Django, Web development, Python


(Thanks Miles Crabil not only for being an awesome Ops person but also for reviewing this blog post!)

My project over the summer, here at Mozilla, has been a project called Mozilla Symbol Server. It's a web service that uploads C++ symbol files, downloads C++ symbol files and symbolicates C++ crash stacktraces. It went into production last week which was fun but there's still lots of work to do on adding beyond-parity features and more optimizations.

What Is Mozilla Symbol Server?

The code name for this project is Tecken and it's written in Python (Django, Gunicorn) and uses PostgreSQL, Redis and Celery. The frontend is entirely static and developed (almost) as a separate project within. The frontend is written in React (using create-react-app and react-router). Everything is run as Docker containers. And if you ask me more details about how it's configured/deployed I'm afraid I have to defer to the awesome Mozilla CloudOps team.

One the challenges I faces developing Tecken is that symbol downloads need to be fast to handle high volumes of traffic. Today I did some load testing on our stage deployment and managed to start 14 concurrent clients that bombarded our staging server with realistic HTTPS GET queries based on log files. It's actually 7 + 1 + 4 + 2 concurrent clients. 7 of them from a m3.2xlarge EC2 node (8 vCPUs), 1 from a m3.large EC2 node (1 vCPU), 2 from two separate NYC based DigitalOcean personal servers and 2 clients here from my laptop on my home broadband. Basically, each loadtest script process got its own CPU.

Total req/s
It's hard to know how much more each client could push if it wasn't slowed down. Either way, the server managed to sustain about 330 requests per second. Our production baseline goal is to able to handle at least 40 requests per second.

After running for a while the caches started getting warm but about 1-5% of requests do have to make a boto3 roundtrip to an S3 bucket located on the other side of America in Oregon. There is also a ~5% penalty in that some requests trigger a write to a central Redis ElastiCache server. That's cheaper than the boto3 S3 call but still hefty latency costs to pay.

The ELB in our staging environment spreads the load between 2 c4.large (2 vCPUs, 3.75GB RAM) EC2 web heads. Each running with preloaded Gunicorn workers between Nginx and Django. Each web head has its own local memcached server to share memory between each worker but only local to the web head.

Is this a lot?

How long is a rope? Hard to tell. Tecken's performance is certainly more than enough and by the sheer fact that it was only just production deployed last week tells me we can probably find a lot of low-hanging fruit optimizations on the deployment side over time.

One way of answering that is to compare it with our lightest endpoint. One that involves absolutely no external resources. It's just pure Python in the form of ELB → Nginx → Gunicorn → Django. If I run hey from the same server I did the load testing I get a topline of 1,300 requests per second.

$ hey -n 10000 -c 10 https://symbols.stage.mozaws.net/__lbheartbeat__
Summary:
  Total:    7.6604 secs
  Slowest:  0.0610 secs
  Fastest:  0.0018 secs
  Average:  0.0075 secs
  Requests/sec: 1305.4199
...  

That basically means that all the extra "stuff" (memcache key prep, memcache key queries and possible other high latency network requests) it needs to do in the Django view takes up roughly 3x the time it takes the absolute minimal Django request-response rendering.

Also, if I use the same technique to bombard a single URL, but one that actually involves most code steps but is definitely able to not require any slow ElastiCache writes or boto3 S3 reads you I get 800 requests per second:

$ hey -n 10000 -c 10 https://symbols.stage.mozaws.net/advapi32.pdb/5EFB9BF42CC64024AB64802E467394642/advapi32.sy
Summary:
  Total:    12.4160 secs
  Slowest:  0.0651 secs
  Fastest:  0.0024 secs
  Average:  0.0122 secs
  Requests/sec: 805.4150
  Total data:   300000 bytes
  Size/request: 30 bytes
...

Lesson learned

Max CPU Used
It's a recurring reminder that performance is almost all about latency. If not RAM or disk it's networking. See the graph of the "Max CPU Used" which basically shows that CPU of user, system and stolen ("CPU spent waiting for the hypervisor to service another virtual CPU") never sum totalling over 50%.

Fastest *local* cache backend possible for Django

04 August 2017 7 comments   Django, Web development, Python


I did another couple of benchmarks of different cache backends in Django. This is an extension/update on Fastest cache backend possible for Django published a couple of months ago. This benchmarking isn't as elaborate as the last one. Fewer tests and fewer variables.

I have another app where I use a lot of caching. This web application will run its cache server on the same virtual machine. So no separation of cache server and web head(s). Just one Django server talking to localhost:11211 (memcached's default port) and localhost:6379 (Redis's default port).

Also in this benchmark, the keys were slightly smaller. To simulate my applications "realistic needs" I made the benchmark fall on roughly 80% cache hits and 20% cache misses. The cache keys were 1 to 3 characters long and the cache values lists of strings always 30 items long (e.g. len(['abc', 'def', 'cba', ... , 'cab']) == 30).

Also, in this benchmark I was too lazy to test all different parsers, serializers and compressors that django-redis supports. I only test python-memcached==1.58 versus django-redis==4.8.0 versus django-redis==4.8.0 && msgpack-python==0.4.8.

The results are quite "boring". There's basically not enough difference to matter.

Config Average Median Compared to fastest
memcache 4.51s 3.90s 100%
redis 5.41s 4.61s 84.7%
redis_msgpack 5.16s 4.40s 88.8%

UPDATE

As Hal pointed out in the comment, when you know the web server and the memcached server is on the same computer you should use UNIX sockets. They're "obviously" faster since the lack of HTTP overhead at the cost of it doesn't work over a network.

Because running memcached on a socket on OSX is a hassle I only have one benchmark. Note! This basically compares good old django.core.cache.backends.memcached.MemcachedCache with two different locations.

Config Average Median Compared to fastest
127.0.0.1:11211 3.33s 3.34s 81.3%
unix:/tmp/memcached.sock 2.66s 2.71s 100%

But there's more! Another option is to use pylibmc which is a Python client written in C. By the way, my Python I use for these microbenchmarks is Python 3.5.

Unfortunately I'm too lazy/too busy to do a matrix comparison of pylibmc on TCP versus UNIX socket. Here are the comparison results of using python-memcached versus pylibmc:

Client Average Median Compared to fastest
python-memcached 3.52s 3.52s 62.9%
pylibmc 2.31s 2.22s 100%

UPDATE 2

https://plot.ly/~jensens/36.embed

Seems my luck someone else has done the matrix comparison of python-memcached vs pylibmc on TCP vs UNIX socket:

https://plot.ly/~jensens/36.embed

Find static files defined in django-pipeline but not found

25 July 2017 0 comments   Django, Python


If you're reading this you're probably familiar with how, in django-pipeline, you define bundles of static files to be combined and served. If you're not familiar with django-pipeline it's unlike this'll be of much help.

The Challenge (aka. the pitfall)

So you specify bundles by creating things in your settings.py something like this:

PIPELINE = {
    'STYLESHEETS': {
        'colors': {
            'source_filenames': (
              'css/core.css',
              'css/colors/*.css',
              'css/layers.css'
            ),
            'output_filename': 'css/colors.css',
            'extra_context': {
                'media': 'screen,projection',
            },
        },
    },
    'JAVASCRIPT': {
        'stats': {
            'source_filenames': (
              'js/jquery.js',
              'js/d3.js',
              'js/collections/*.js',
              'js/aplication.js',
            ),
            'output_filename': 'js/stats.js',
        }
    }
}

You do a bit more configuration and now, when you run ./manage.py collectstatic --noinput Django and django-pipeline will gather up all static files from all Django apps installed, then start post processing then and doing things like concatenating them into one file and doing stuff like minification etc.

The problem is, if you look at the example snippet above, there's a typo. Instead of js/application.js it's accidentally js/aplication.js. Oh noes!!

What's sad is it that nobody will notice (running ./manage.py collectstatic will exit with a 0). At least not unless you do some careful manual reviewing. Perhaps you will notice later, when you've pushed the site to prod, that the output file js/stats.js actually doesn't contain the code from js/application.js.

Or, you can automate it!

A Solution (aka. the hack)

I started this work this morning because the error actually happened to us. Thankfully not in production but our staging server produced a rendered HTML page with <link href="/static/css/report.min.cd784b4a5e2d.css" rel="stylesheet" type="text/css" /> which was an actual file but it was 0 bytes.

It wasn't that hard to figure out what the problem was because of the context of recent changes but it would have been nice to catch this during continuous integration.

So what we did was add an extra class to settings.STATICFILES_FINDERS called myproject.base.finders.LeftoverPipelineFinder. So now it looks like this:

# in settings.py

STATICFILES_FINDERS = (
    'django.contrib.staticfiles.finders.FileSystemFinder',
    'django.contrib.staticfiles.finders.AppDirectoriesFinder',
    'pipeline.finders.PipelineFinder',
    'myproject.finders.LeftoverPipelineFinder',  # the new hotness!
)

And here's the class implementation:

from pipeline.finders import PipelineFinder

from django.conf import settings
from django.core.exceptions import ImproperlyConfigured


class LeftoverPipelineFinder(PipelineFinder):
    """This finder is expected to come AFTER 
    django.contrib.staticfiles.finders.FileSystemFinder and 
    django.contrib.staticfiles.finders.AppDirectoriesFinder in 
    settings.STATICFILES_FINDERS.
    If a path is looked for here it means it's trying to find a file
    that none of the regular staticfiles finders couldn't find.
    """
    def find(self, path, all=False):
        # Before we raise an error, try to find out where,
        # in the bundles, this was defined. This will make it easier to correct
        # the mistake.
        for config_name in 'STYLESHEETS', 'JAVASCRIPT':
            config = settings.PIPELINE[config_name]
            for key in config:
                if path in config[key]['source_filenames']:
                    raise ImproperlyConfigured(
                        'Static file {!r} can not be found anywhere. Defined in '
                        "PIPELINE[{!r}][{!r}]['source_filenames']".format(
                            path,
                            config_name,
                            key,
                        )
                    )
        # If the file can't be found AND it's not in bundles, there's
        # got to be something else really wrong.
        raise NotImplementedError(path)

Now, if you have a typo or something in your bundles, you'll get a nice error about it as soon as you try to run collectstatic. For example:

▶ ./manage.py collectstatic --noinput
Post-processed 'css/search.min.css' as 'css/search.min.css'
Post-processed 'css/base.min.css' as 'css/base.min.css'
Post-processed 'css/base-dynamic.min.css' as 'css/base-dynamic.min.css'
Post-processed 'js/google-analytics.min.js' as 'js/google-analytics.min.js'
Traceback (most recent call last):
...
django.core.exceptions.ImproperlyConfigured: Static file 'js/aplication.js' can not be found anywhere. Defined in PIPELINE['JAVASCRIPT']['stats']['source_filenames']

Final Thoughts

This was a morning hack. I'm still not entirely sure if this the best approach, but there was none better and the result is pretty good.

We run ./manage.py collectstatic --noinput in our continous integration just before it runs ./manage.py test. So if you make a Pull Request that has a typo in bundles.py it will get caught.

Unfortunately, it won't find missing files if you use foo*.js or something like that. django-pipeline uses glob.glob to convert expressions like that into a list of actual files and that depends on the filesystem and all of that happens before the django.contrib.staticfiles.finders.find function is called.

If you have any better suggestions to solve this, please let me know.

Fastest Redis configuration for Django

11 May 2017 0 comments   Django, Web development, Linux, Python

https://github.com/peterbe/django-fastest-redis


I have an app that does a lot of Redis queries. It all runs in AWS with ElastiCache Redis. Due to the nature of the app, it stores really large hash tables in Redis. The application then depends on querying Redis for these. The question is; What is the best configuration possible for the fastest service possible?

Note! Last month I wrote Fastest cache backend possible for Django which looked at comparing Redis against Memcache. Might be an interesting read too if you're not sold on Redis.

Options

All options are variations on the compressor, serializer and parser which are things you can override in django-redis. All have an effect on the performance. Even compression, for if the number of bytes between Redis and the application is smaller, then it should have better network throughput.

Without further ado, here are the variations:

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/0',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
        }
    },
    "json": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/1',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "SERIALIZER": "django_redis.serializers.json.JSONSerializer",
        }
    },
    "ujson": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/2',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "SERIALIZER": "fastestcache.ujson_serializer.UJSONSerializer",
        }
    },
    "msgpack": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/3',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "SERIALIZER": "django_redis.serializers.msgpack.MSGPackSerializer",
        }
    },
    "hires": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/4',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "PARSER_CLASS": "redis.connection.HiredisParser",
        }
    },
    "zlib": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/5',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "COMPRESSOR": "django_redis.compressors.zlib.ZlibCompressor",
        }
    },
    "lzma": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config('REDIS_LOCATION', 'redis://127.0.0.1:6379') + '/6',
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "COMPRESSOR": "django_redis.compressors.lzma.LzmaCompressor"
        }
    },
}

As you can see, they each have a variation on the OPTIONS.PARSER_CLASS, OPTIONS.SERIALIZER or OPTIONS.COMPRESSOR.

The default configuration is to use redis-py and to pickle the Python objects to a bytestring. Pickling in Python is pretty fast but it has the disadvantage that it's Python specific so you can't have a Ruby application reading the same Redis database.

The Experiment

Note how I have one LOCATION per configuration. That's crucial for the sake of testing. That way one database is all JSON and another is all gzip etc.

What the benchmark does is that it measures how long it takes to READ a specific key (called benchmarking). Then, once it's done that it appends that time to the previous value (or [] if it was the first time). And lastly it writes that list back into the database. That way, towards the end you have 1 key whose value looks something like this: [0.013103008270263672, 0.003879070281982422, 0.009411096572875977, 0.0009970664978027344, 0.0002830028533935547, ..... MANY MORE ....].

Towards the end, each of these lists are pretty big. About 500 to 1,000 depending on the benchmark run.

In the experiment I used wrk to basically bombard the Django server on the URL /random (which makes a measurement with a random configuration). On the EC2 experiment node, it finalizes around 1,300 requests per second which is a decent number for an application that does a fair amount of writes.

The way I run the Django server is with uwsgi like this:

uwsgi --http :8000 --wsgi-file fastestcache/wsgi.py --master --processes 4 --threads 2

And the wrk command like this:

wrk -d30s  "http://127.0.0.1:8000/random"

(that, by default, runs 2 threads on 10 connections)

At the end of starting the benchmarking, I open http://localhost:8000/summary which spits out a table and some simple charts.

An Important Quirk

Time measurements over time
One thing I noticed when I started was that the final numbers' average was very different from the medians. That would indicate that there are spikes. The graph on the right shows the times put into that huge Python list for the default configuration for the first 200 measurements. Note that there are little spikes but generally quite flat over time once it gets past the beginning.

Sure enough, it turns out that in almost all configurations, the time it takes to make the query in the beginning is almost order of magnitude slower than the times once the benchmark has started running for a while.

So in the test code you'll see that it chops off the first 10 times. Perhaps it should be more than 10. After all, if you don't like the spikes you can simply look at the median as the best source of conclusive truth.

The Code

The benchmarking code is here. Please be aware that this is quite rough. I'm sure there are many things that can be improved, but I'm not sure I'm going to keep this around.

The Equipment

The ElastiCache Redis I used was a cache.m3.xlarge (13 GiB, High network performance) with 0 shards and 1 node and no multi-zone enabled.

The EC2 node was a m4.xlarge Ubuntu 16.04 64-bit (4 vCPUs and 16 GiB RAM with High network performance).

Both the Redis and the EC2 were run in us-west-1c (North Virginia).

The Results

Here are the results! Sorry if it looks terrible on mobile devices.

root@ip-172-31-2-61:~# wrk -d30s  "http://127.0.0.1:8000/random" && curl "http://127.0.0.1:8000/summary"
Running 30s test @ http://127.0.0.1:8000/random
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.19ms    6.32ms  60.14ms   80.12%
    Req/Sec   583.94    205.60     1.34k    76.50%
  34902 requests in 30.03s, 2.59MB read
Requests/sec:   1162.12
Transfer/sec:     88.23KB
                         TIMES        AVERAGE         MEDIAN         STDDEV
json                      2629        2.596ms        2.159ms        1.969ms
msgpack                   3889        1.531ms        0.830ms        1.855ms
lzma                      1799        2.001ms        1.261ms        2.067ms
default                   3849        1.529ms        0.894ms        1.716ms
zlib                      3211        1.622ms        0.898ms        1.881ms
ujson                     3715        1.668ms        0.979ms        1.894ms
hires                     3791        1.531ms        0.879ms        1.800ms

Best Averages (shorter better)
###############################################################################
██████████████████████████████████████████████████████████████   2.596  json
█████████████████████████████████████                            1.531  msgpack
████████████████████████████████████████████████                 2.001  lzma
█████████████████████████████████████                            1.529  default
███████████████████████████████████████                          1.622  zlib
████████████████████████████████████████                         1.668  ujson
█████████████████████████████████████                            1.531  hires
Best Medians (shorter better)
###############################################################################
███████████████████████████████████████████████████████████████  2.159  json
████████████████████████                                         0.830  msgpack
████████████████████████████████████                             1.261  lzma
██████████████████████████                                       0.894  default
██████████████████████████                                       0.898  zlib
████████████████████████████                                     0.979  ujson
█████████████████████████                                        0.879  hires


Size of Data Saved (shorter better)
###############################################################################
█████████████████████████████████████████████████████████████████  60K  json
██████████████████████████████████████                             35K  msgpack
████                                                                4K  lzma
█████████████████████████████████████                              35K  default
█████████                                                           9K  zlib
████████████████████████████████████████████████████               48K  ujson
█████████████████████████████████████                              34K  hires

Discussion Points

Conclusion

This experiment has lead me to the conclusion that the best serializer is msgpack and the best compression is zlib. That is the best configuration for django-redis.

msgpack has implementation libraries for many other programming languages. Right now that doesn't matter for my application but if msgpack is both faster and more versatile (because it supports multiple languages) I conclude that to be the best serializer instead.

A decent Elasticsearch search engine implementation

09 April 2017 0 comments   Django, Web development, Python


The title is a bit of an understatement because I think it's pretty good. It's not perfect and it's not guaranteed to scale, but it works pretty well. Especially on search term typos.

This, my own blog, now has a search engine built with Elasticsearch using the Python library elasticsearch-dsl. The algorithm (if you can call it that) is my own afternoon hack invention. Before I explain how it works try out a couple of searches:

Try a couple of searches:

(each search appends &debug-search for extended output)

Also, by default it uses Elasticsearch's match_phrase so when you search for a multi-word thing, it requires a match on each term. E.g. date format which finds Date formatting, date formats etc.

But if you search for something where the whole phrase can't match, it splits up the search an uses a match operator instead (minus any stop words).

Typo-focussed

This solution is very much focussed on typos. One thing I really dislike in non-Google search engines is when you make a search and nothing is found and it says "Did you mean ...?". Quite likely I did, but why do I have to click it? Can't it just be clicked for me?

Also, if there's ambiguity and possibly some results based on what you typed and multiple potential "Did you mean...?". Why not just blend them alltogether like Google does? Here is my attempt to solve that. Come with me...

Figuring Out ALL Search Terms

So if you type "Firefix" (not "Firefox", also scroll to the bottom to see the debug table) then maybe, that's an actual word that might be in the database. Then by using the Elasticsearch's Suggesters it figures out alternative spellings based on frequency distributions within the indexed content. This lookup is actually really fast. So now it figures out three alternative ways to spell this term:

And, very arbitrarily I pick a score for the default term that the user typed in. Let's pick 1.1. Doesn't matter gravely and it's up for future tuning. The initial goal is to not bury this spelling alternative too far back.

Here's how to run the suggester for every defined doc type and generate a list of other search terms tuples (minimum score >=0.6).

search_terms = [(1.1, q)]
_search_terms = set([q])
doc_type_keys = (
    (BlogItemDoc, ('title', 'text')),
    (BlogCommentDoc, ('comment',)),
)
for doc_type, keys in doc_type_keys:
    suggester = doc_type.search()
    for key in keys:
        suggester = suggester.suggest('sugg', q, term={'field': key})
    suggestions = suggester.execute_suggest()
    for each in suggestions.sugg:
        if each.options:
            for option in each.options:
                if option.score >= 0.6:
                    better = q.replace(each['text'], option['text'])
                    if better not in _search_terms:
                        search_terms.append((
                            option['score'],
                            better,
                        ))
                        _search_terms.add(better)

Eventually we get a list (once sorted) that looks like this:

search_terms = [(1.1 'firefix'), (0.9, 'firefox'), (0.7, 'firefli'), (0.7, 'firfox')]

The only reason the code sorts this by the score is in case there are crazy-many search terms. Then we might want to chop off some and only use the 5 highest scoring spelling alternatives.

Building The Boosted OR-query

In this scenario, we're searching amongst blog posts. The title is likely to be a better match than the body. If the title mentions it we probably want to favor that over those where it's only mentioned in the body.

So to build up the OR-query we'll boost the title more than the body ("text" in this example) and we'll build it up using all possible search terms and boost them based on their score. Here's the complete query.

strategy = 'match_phrase'
if original_q:
    strategy = 'match'
search_term_boosts = {}
for i, (score, word) in enumerate(search_terms):
    # meaning the first search_term should be boosted most
    j = len(search_terms) - i
    boost = 1 * j * score
    boost_title = 2 * boost
    search_term_boosts[word] = (boost_title, boost)
    match = Q(strategy, title={
        'query': word,
        'boost': boost_title,
    }) | Q(strategy, text={
        'query': word,
        'boost': boost,
    })
    if matcher is None:
        matcher = match
    else:
        matcher |= match

search_query = search_query.query(matcher)

The core is that it does Q('match_phrase' title='firefix', boost=2X) | Q('match_phrase', text='firefix', boost=X).

Here's another arbitrary number. The number 2. It means that the "title" is 2 times more important than the "text".

And that's it! Now every match is scored based on how suggester's score and whether it be matched on the "title" or the "text" (or both). Elasticsearch takes care of everything else. The default is to sort by the _score as ultimately dictated by Lucene.

Match Phrase or Match

In this implementation it tries to match using a match phrase query which basically tries to find matches where every word in the query matches.

The cheap solution here is to basically keep whole search function as is, but if absolutely nothing is found with a match_phrase, and there were multiple words, then just recurse over one more time and do it with a match query instead.

This could probably be improved and do the match_phrase first with higher boost and do the match too but with a lower boost. All in one big query.

Want A Copy?

Note, this copy is quite a mess! It's a personal side-project which is an excuse for experimentation and goofing around.

The full search function is here.

Please don't judge me for the scrappiness of the code but please share your thoughts on this being a decent application of Elasticsearch for smallish datasets like a blog.

Podcasttime.io - How Much Time Do Your Podcasts Take To Listen To?

13 February 2017 3 comments   ReactJS, Javascript, Django, Web development, Python

https://podcasttime.io/about


tl;dr; It's a web app where you search and find the podcasts you listen to. It then gives you a break down how much time that requires to keep up, per day, per week and per month. Podcasttime.io

Podcasttime.io on Firefox iOS
First I wrote some scripts to scrape various sources of podcasts. This is basically a RSS feed URL from which you can fetch the name and an image. And with some cron jobs you can download and parse each podcast feed and build up an index of how many episodes they have and how long each episode is. Together with each episodes "publish date" you can easily figure out an average of how much content each podcast puts out over time.

Suppose you listen to JavaScript Air, Talk Python To Me and Google Cloud Platform Podcast for example, that means you need to listen to podcasts for about 8 minutes per day to keep up.

The Back End

The technology is exciting. The backend is a Django 1.10 server. It manages a PostgreSQL database of all the podcasts, episodes, cron jobs etc. Through Django ORM signals is packages up each podcast with its metadata and stores it in an Elasticsearch database. All the communication between Django and ElasticSearch is done with Elasticsearch DSL.

Also, all the downloading and parsing of feeds is done as background tasks in Celery. This got really interesting/challenging because sooo many podcasts are poorly marked up and many a times the only way to find out how long an episode is is to use ffmpeg to probe it and that takes time.

Another biggish challenge is that fact that often things simply don't work because of networks being what they are, unreliable. So you have to re-attempt network calls without accidentally getting caught in infinite loops of accidentally putting a bad/broken RSS feed back into the background queue again and again and again.

The Front End

Actually, the first prototype of this app was written with Django as the front end plus some jQuery to tie things together. On a plane ride, and as an excuse to learn it, I re-wrote the whole thing in React with Redux. To be honest, I never really enjoyed that and it felt like everything was hard and I had to do more jumping-around-files than actual coding. In particular, Redux is nice but when you have a lot of AJAX both inside components and upon mounting it gets quite messy in my humble opinion.

So, on another plane ride (to Hawaii, so I had more time) I re-wrote it from scratch but this time using three beautiful pieces of front end technology: create-react-app, Mobx and mobx-router. Suddenly it became fun again. Mobx (or Redux or something "fluxy") is necessary if you want fancy pushState URLs AND a central (aka global) state management.

To be perfectly honest, I never actually tried combining Mobx with something like react-router or if it's even possible. But with mobx-router it's quite neat. You write a "views route map" (see example) where you can kick off AJAX before entering (and leaving) routes. Then you use that to populate a global store and now all components can be almost entirely about simply rendering the store. There is some AJAX within the mounted components (e.g. the search and autocomplete).

Plotly graph
On the home page, there's a chart that rather unscientifically plots episode durations over time as a line chart. I'm trying a library called Plotly which is actually a online app for building charts but they offer a free JavaScript library too for generating graphs. Not entirely sure how I feel about it yet but apart from looking a big crowded on mobile, it's working really well.

A Killer Feature

This is a pattern I've wanted to build but never managed to get right. The way to get data about a podcast (and its episodes) is to do an Elasticsearch search. From the homepage you basically call /find?q=Planet%20money when you search. That gives you almost all the information you need. So you store that in the global store. Then, if the user clicks on that particular podcast to go to its "perma page" you can simply load that podcast's individual route and you don't need to do something like /find?id=727 because you already have everything you need. If the user then opens that page in a new tab or reloads you now have to fetch just the one podcast, so you simply call /find?id=727. In other words, subsequent page loads load instantly! (Basically, it updates the store's podcast object upon clicking any of the podcasts iterated over from the listing. Code here)

And to top that - and this is where a good router shines - if you make a search or something, click something and click back since you have a global store of state, you can simply reuse that without needing another AJAX query.

The State of the Future

First of all, this is a fun little side project and it's probably buggy. My goal is not to make money on it but to build up a graph. Every time someone uses the site and finds the podcasts they listen to that slowly builds up connections. If you listen to "The Economist", "Planet Money" and "Freakonomics", that tie those together loosely. It's hard to programmatically know that those three podcasts are "related" but they are by "peoples' taste".

The ultimate goal of this is; now I can recommend other podcasts based on a given set. It's a little bit like LastFM used to work. Using Audioscrobbler LastFM was able to build up a graph based on what people preferred to listen to and using that network of knowledge they can recommend things you have not listened to but probably would appreciate.

At the moment, there's a simple Picks listing of "lists" (aka "picks") that people have chosen. With enough time and traffic I'll try to use Elasticsearch's X-Pack Graph capabilities to develop a search engine based on this.

At the time of writing, I've indexed 4,669 podcasts, spanning 611,025 episodes which equates to 549,722 hours of podcast content.

The Code

The front end code is available on github.com/peterbe/podcasttime2 and is relatively neat and tidy. The most interesting piece is probably the views/index.js which is the "controller" of things. That's where it decides which component to render, does the AJAX queries and manages the global store.

The back end code is a bit messier. It's done as an "app" as part of this very blog. The way the Elasticsearch indexing is configured is here and the hotch potch code for scraping and parsing RSS feeds is here.

Please try it out and show me your selection. You can drop feedback here.

Using Fanout.io in Django

13 December 2016 1 comment   Python, Web development, Django, Mozilla, Javascript


Earlier this year we started using Fanout.io in Air Mozilla to enhance the experience for users awaiting content updates. Here I hope to flesh out its details a bit to inspire others to deploy a similar solution.

What It Is

First of all, Fanout.io is basically a service that handles your WebSockets. You put in some of Fanout's JavaScript into your site that handles a persistent WebSocket connection between your site and Fanout.io. And to push messages to your user you basically send them to Fanout.io from the server and they "forward" it to the WebSocket.

The HTML page looks like this:

<html>
<body>

  <h1>Web Page</h1>

<!-- replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page -->
<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
></script>
<script src="fanout.js"></script>
</body>
</html>

And the fanout.js script looks like this:

window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
     console.log('Incoming updated data from the server:', data);
  })
};

And in server it looks something like this:

from django.conf import settings
import fanout

fanout.realm = settings.FANOUT_REALM_ID
fanout.key = settings.FANOUT_REALM_KEY


def post_comment(request):
    """A django view function that saves the posted comment"""
   text = request.POST['comment']
   saved_comment = Comment.objects.create(text=text, user=request.user)
   fanout.publish('mycomments', {'new_comment': saved_comment.id})
   return http.JsonResponse({'comment_posted': True})

Note that, in the client-side code, there's no security since there's no authentication. Any client can connect to any channel. So it's important that you don't send anything sensitive. In fact, you should think of this pattern simply as a hint that something has changed. For example, here's a slightly more fleshed out example of how you'd use the subscription.

window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
    if (data.new_comment) {
      // server says a new comment has been posted in the server
      $.json('/comments', function(response) {
        $('#comments .comment').remove();
        $.each(response.comments, function(comment) {        
          $('<div class="comment">')
          .append($('<p>').text(comment.text))
          .append($('<span>').text('By: ' + comment.user.name))
          .appendTo('#comments');
        });
      });
    }
  })
};

Yes, I know jQuery isn't hip but it demonstrates the pattern well. Also, in the real world you might not want to ask the server for all comments (and re-render) but instead do an AJAX query to get all new comments since some parameter or something.

Why It's Awesome

It's awesome because you can have a simple page that updates near instantly when the server's database is updated. The alternative would be to do a setInterval loop that frequently does an AJAX query to see if there's new content to update. This is cumbersome because it requires a lot heavier AJAX queries. You might want to make it secure so you engage sessions that need to be looked up each time. Or, since you're going to request it often you have to write a very optimized server-side endpoint that is cheap to query often.

And last but not least, if you rely on an AJAX loop interval, you have to pick a frequency that your server can cope with and it's likely to be in the range of several seconds or else it might overload the server. That means that updates are quite delayed.

But maybe most important, you don't need to worry about running a WebSocket server. It's not terribly hard to do one yourself on your laptop with a bit of Node Express or Tornado but now you have yet another server to maintain and it, internally, needs to be connected to a "pub-sub framework" like Redis or a full blown message queue.

Alternatives

Fanout.io is not the only service that offers this. The decision to use Fanout.io was taken about a year ago and one of the attractive things it offers is that it's got a freemium option which is ideal for doing local testing. The honest truth is that I can't remember the other justifications used to chose Fanout.io over its competitors but here are some alternatives that popped up on a quick search:

It seems they all (including Fanout.io) has freemium plans, supports authentication, REST APIs (for sending and for querying connected clients' stats).

There are also some more advanced feature packed solutions like Meteor, Firebase and GunDB that act more like databases that are connected via WebSockets or alike. For example, you can have a database as a "conduit" for pushing data to a client. Meaning, instead of sending the data from the server directly you save it in a database which syncs to the connected clients.

Lastly, I've heard that Heroku has a really neat solution that does something similar whereby it sets up something similar as an extension.

Let's Get Realistic

The solution sketched out above is very simplistic. There are a lot more fine-grained details that you'd probably want to zoom in to if you're going to do this properly.

Throttling

In Air Mozilla, we call fanout.publish(channel, message) from a post_save ORM signal. If you have a lot of saves for some reason, you might be sending too many messages to the client. A throttling solution, per channel, simply makes sure your "callback" gets called only once per channel per small time frame. Here's the solution we employed:

window.Fanout = (function() {
  var _locks = {};
  return {
    subscribe: function subscribe(channel, callback) {
      _client.subscribe(channel, function(data) {
          if (_locks[channel]) {
              // throttled
              return;
          }
          _locks[channel] = true;
          callback(data);
          setTimeout(function() {
              _locks[channel] = false;
          }, 500);
      });        
    };
  }
})();

Subresource Integrity

Subresource integrity is an important web security technique where you know in advance a hash of the remote JavaScript you include. That means that if someone hacks the result of loading https://cdn.example.com/somelib.js the browser compares the hash of that with a hash mentioned in the <script> tag and refuses to load it if the hash doesn't match.

In the example of Fanout.io it actually looks like this:

<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
  crossOrigin="anonymous"
  integrity="sha384-/9uLm3UDnP3tBHstjgZiqLa7fopVRjYmFinSBjz+FPS/ibb2C4aowhIttvYIGGt9"
></script>

The SHA you get from the Fanout.io documentation. It requires, and implies, that you need to use an exact version of the library. You can't use it like this: <script src="https://cdn.example/somelib.latest.min.js" ....

WebSockets vs. Long-polling

Fanout.io's JavaScript client follows a pattern that makes it compatible with clients that don't support WebSockets. The first technique it uses is called long-polling. With this the server basically relys on standard HTTP techniques but the responses are long lasting instead. It means the request simply takes a very long time to respond and when it does, that's when data can be passed.

This is not a problem for modern browsers. They almost all support WebSocket but you might have an application that isn't a modern browser.

Anyway, what Fanout.io does internally is that it first creates a long-polling connection but then shortly after tries to "upgrade" to WebSockets if it's supported. However, the projects I work only need to support modern browsers and there's a trick to tell Fanout to go straight to WebSockets:

var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux', {
    // What this means is that we're opting to have
    // Fanout *start* with fancy-pants WebSocket and
    // if that doesn't work it **falls back** on other
    // options, such as long-polling.
    // The default behaviour is that it starts with
    // long-polling and tries to "upgrade" itself
    // to WebSocket.
    transportMode: 'fallback'
});

Fallbacks

In the case of Air Mozilla, it already had a traditional solution whereby it does a setInterval loop that does an AJAX query frequently.

Because the networks can be flaky or because something might go wrong in the client, the way we use it is like this:

var RELOAD_INTERVAL = 5;  // seconds

if (typeof window.Fanout !== 'undefined') {
    Fanout.subscribe('/' + container.data('subscription-channel-comments'), function(data) {
        // Supposedly the comments have changed.
        // For security, let's not trust the data but just take it
        // as a hint that it's worth doing an AJAX query
        // now.
        Comments.load(container, data);
    });
    // If Fanout doesn't work for some reason even though it
    // was made available, still use the regular old
    // interval. Just not as frequently.
    RELOAD_INTERVAL = 60 * 5;
}
setInterval(function() {
    Comments.reload_loop(container);
}, RELOAD_INTERVAL * 1000);

Use Fanout Selectively/Progressively

In the case of Air Mozilla, there are lots of pages. Some don't ever need a WebSocket connection. For example, it might be a simple CRUD (Create Update Delete) page. So, for that I made the whole Fanout functionality "lazy" and it only gets set up if the page has some JavaScript that knows it needs it.

This also has the benefit that the Fanout resource loading etc. is slightly delayed until more pressing things have loaded and the DOM is ready.

You can see the whole solution here. And the way you use it here.

Have Many Channels

You can have as many channels as you like. Don't create a channel called comments when you can have a channel called comments-123 where 123 is the ID of the page you're on for example.

In the case of Air Mozilla, there's a channel for every single page. If you're sitting on a page with a commenting widget, it doesn't get WebSocket messages about newly posted comments on other pages.

Conclusion

We've now used Fanout for almost a year in our little Django + jQuery app and it's been great. The management pages in Air Mozilla use AngularJS and the integration looks like this in the event manager page:

window.Fanout.subscribe('/events', function(data) {
    $scope.$apply(lookForModifiedEvents);
});

Fanout.io's been great to us. Really responsive support and very reliable. But if I were to start a fresh new project that needs a solution like this I'd try to spend a little time to investigate the competitors to see if there are some neat features I'd enjoy.

UPDATE

Fanout reached out to help explain more what's great about Fanout.io

"One of Fanout's biggest differentiators is that we use and promote open technologies/standards. For example, our service supports the open Bayeux protocol, and you can connect to it with any compatible client library, such as Faye. Nearly all competing services have proprietary protocols. This "open" aspect of Fanout aligns pretty well with Mozilla's values, and in fact you'd have a hard time finding any alternative that works the same way."

Optimization of QuerySet.get() with or without select_related

03 November 2016 1 comment   Python, Django, PostgreSQL


If you know you're going to look up a related Django ORM object from another one, Django automatically takes care of that for you.

To illustrate, imaging a mapping that looks like this:

class Artist(models.Models):
    name = models.CharField(max_length=200)
    ...

class Song(models.Models):
    artist = models.ForeignKey(Artist)
    ...

And with that in mind, suppose you do this:

>>> Song.objects.get(id=1234567).artist.name
'Frank Zappa'

Internally, what Django does is that it looks the Song object first, then it does a look up automatically on the Artist. In PostgreSQL it looks something like this:

SELECT "main_song"."id", "main_song"."artist_id", ... FROM "main_song" WHERE "main_song"."id" = 1234567
SELECT "main_artist"."id", "main_artist"."name", ... FROM "main_artist" WHERE "main_artist"."id" = 111

Pretty clear. Right.

Now if you know you're going to need to look up that related field you can ask Django to make a join before the lookup even happens. It looks like this:

>>> Song.objects.select_related('artist').get(id=1234567).artist.name
'Frank Zappa'

And the SQL needed looks like this:

SELECT "main_song"."id", ... , "main_artist"."name", ... 
FROM "main_song" INNER JOIN "main_artist" ON ("main_song"."artist_id" = "main_artist"."id") WHERE "main_song"."id" = 1234567

The question is; which is fastest?

Well, there's only one way to find out and that is to measure with some relatistic data.

Here's the benchmarking code:

def f1(id):
    try:
        return Song.objects.get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def f2(id):
    try:
        return Song.objects.select_related('artist').get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def _stats(r):
    #returns the median, average and standard deviation of a sequence
    tot = sum(r)
    avg = tot/len(r)
    sdsq = sum([(i-avg)**2 for i in r])
    s = list(r)
    s.sort()
    return s[len(s)//2], avg, (sdsq/(len(r)-1 or 1))**.5

times = defaultdict(list)
functions = [f1, f2]
for id in range(100000, 103000):
    for f in functions:
        t0 = time.time()
        r = f(id)
        t1 = time.time()
        if r:
            times[f.__name__].append(t1-t0)
    # Shuffle the order so that one doesn't benefit more
    # from deep internal optimizations/caching in Postgre.
    random.shuffle(functions)

for k, values in times.items():
    print(k, [round(x * 1000, 2) for x in _stats(values)])

For the record, here are the parameters of this little benchmark:

The Result

Function Median Average Std Dev
f1 3.19ms 9.17ms 19.61ms
f2 2.28ms 6.28ms 15.30ms

The Conclusion

If you use the median, using select_related is 30% faster and if you use the average, using select_related is 46% faster.

So, if you know you're going to need to do that lookup put in .select_related(relation) before every .get(id=...) in your Django code.

Deep down in PostgreSQL, the inner join is ultimately two ID-by-index lookups. And that's what the first method is too. It's likely that the reason the inner join approach is faster is simply because there's less connection overheads.

Lastly, YOUR MILEAGE WILL VARY. Every benchmark is flawed but this quite realistic because it's not trying to be optimized in either way.

Django test optimization with no-op PIL engine

27 October 2016 6 comments   Python, Django


The Air Mozilla project is a regular Django webapp. It's reasonably big for a more or less one man project. It's ~200K lines of Python and ~100K lines of JavaScript. There are 816 "unit tests" at the time of writing. Most of them are kinda typical Django tests. Like:

def test_some_feature(self):
    thing = MyModel.objects.create(key='value')
    url = reverse('namespace:name', args=(thing.id,))
    response = self.client.get(url)
    ....

Also, the site uses sorl.thumbnail to automatically generate thumbnails from uploaded images. It's a great library.

However, when running tests, you almost never actually care about the image itself. Your eyes will never feast on them. All you care about is that there is an image, that it was resized and that nothing broke. You don't write tests that checks the new image dimensions of a generated thumbnail. If you need tests that go into that kind of detail, it best belongs somewhere else.

So, I thought, why not fake ALL operations that are happening inside sorl.thumbnail to do with resizing and cropping images.

Here's the changeset that does it. Note, that the trick is to override the default THUMBNAIL_ENGINE that sorl.thumbnail loads. It usually defaults to sorl.thumbnail.engines.pil_engine.Engine and I just wrote my own that does no-ops in almost every instance.

I admittedly threw it together quite quickly just to see if it was possible. Turns out, it was.

# Depends on setting something like:
#    THUMBNAIL_ENGINE = 'airmozilla.base.tests.testbase.FastSorlEngine'
# in your settings specifically for running tests.


from sorl.thumbnail.engines.base import EngineBase


class _Image(object):
    def __init__(self):
        self.size = (1000, 1000)
        self.mode = 'RGBA'
        self.data = '\xa0'


class FastSorlEngine(EngineBase):

    def get_image(self, source):
        return _Image()

    def get_image_size(self, image):
        return image.size

    def _colorspace(self, image, colorspace):
        return image

    def _scale(self, image, width, height):
        image.size = (width, height)
        return image

    def _crop(self, image, width, height, x_offset, y_offset):
        image.size = (width, height)
        return image

    def _get_raw_data(self, image, *args, **kwargs):
        return image.data

    def is_valid_image(self, raw_data):
        return bool(raw_data)

So, was it much faster?

It's hard to measure because the time it takes to run the whole test suite depends on other stuff going on on my laptop during the long time it takes to run the tests. So I ran them 8 times with the old code and 8 times with this new hack.

Iteration Before After
1 82.789s 73.519s
2 82.869s 67.009s
3 77.100s 60.008s
4 74.642s 58.995s
5 109.063s 80.333s
6 100.452s 81.736s
7 85.992s 61.119s
8 82.014s 73.557s
Average 86.865s 69.535s
Median 82.869s 73.519s
Std Dev 11.826s 9.0757s

So rougly 11% faster. Not a lot but it adds up when you're doing test-driven development or debugging where you run a suite or a test over and over as you're saving the files/tests you're working on.

Room for improvement

In my case, it just worked with this simple solution. Your site might do fancier things with the thumbnails. Perhaps we can combine forces on this and finalize a working solution into a standalone package.