Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page!
Currently only showing blog entries under the category: Django. Clear filter

django-pipeline and Zopfli

15 August 2018 0 comments   Python, Web development, Django


tl;dr; I wrote my own extension to django-pipeline that uses Zopfli to create .gz files from static assets collected in Django. Here's the code.

Nginx and Gzip

What I wanted was to continue to use django-pipeline which does a great job of reading a settings.BUNDLES setting and generating things like /static/js/myapp.min.a206ec6bd8c7.js. It has configurable options to not just make those files but also generate /static/js/myapp.min.a206ec6bd8c7.js.gz which means that with gzip_static in Nginx, Nginx doesn't have to Gzip compress static files on-the-fly but can basically just read it from disk. Nginx doesn't care how the file got there but an immediate advantage of preparing the file on disk is that the compression can be higher (smaller .gz files). That means smaller responses to be sent to the client and less CPU work needed from Nginx. Your job is to set gzip_static on; in your Nginx config (per location) and make sure every compressable file exists on disk with the same name but with the .gz suffix.

In other words, when the client does GET https://example.com/static/foo.js Nginx quickly does a read on the file system to see if there exists a ROOT/static/foo.js.gz and if so, return that. If the files doesn't exist, and you have gzip on; in your config, Nginx will read the ROOT/static/foo.js into memory, compress it (usually with a lower compression level) and return that. Nginx takes care of figuring out whether to do this, at all, dynamically by reading the Accept-Encoding header from the request.

Zopfli

The best solution today to generate these .gz files is Zopfli. Zopfli is slower than good old regular gzip but the files get smaller. To manually compress a file you can install the zopfli executable (e.g. brew install zopfli or apt install zopfli) and then run zopfli $ROOT/static/foo.js which creates a $ROOT/static/foo.js.gz file.

So your task is to build some pipelining code that generates .gz version of every static file your Django server creates.
At first I tried django-static-compress which has an extension to regular Django staticfiles storage. The default staticfiles storage is django.contrib.staticfiles.storage.StaticFilesStorage and that's what django-static-compress extends.

But I wanted more. I wanted all the good bits from django-pipeline (minification, hashes in filenames, concatenation, etc.) Also, in django-static-compress you can't control the parameters to zopfli such as the number of iterations. And with django-static-compress you have to install Brotli which I can't use because I don't want to compile my own Nginx.

Solution

So I wrote my own little mashup. I took some ideas from how django-pipeline does regular gzip compression as a post-process step. And in my case, I never want to bother with any of the other files that are put into the settings.STATIC_ROOT directory from the collectstatic command.

Here's my implementation: peterbecom.storage.ZopfliPipelineCachedStorage. Check it out. It's very tailored to my personal preferences and usecase but it works great. To use it, I have this in my settings.py: STATICFILES_STORAGE = "peterbecom.storage.ZopfliPipelineCachedStorage"

I know what you're thinking

Why not try to get this into django-pipeline or into django-compress-static. The answer is frankly laziness. Hopefully someone else can pick up this task. I have fewer and fewer projects where I use Django to handle static files. These days most of my projects are single-page-apps that are 100% static and using Django for XHR requests to get the data.

Django lock decorator with django-redis

14 August 2018 1 comment   Python, Web development, Django, Redis


Here's the code. It's quick-n-dirty but it works wonderfully:

import functools
import hashlib

from django.core.cache import cache
from django.utils.encoding import force_bytes


def lock_decorator(key_maker=None):
    """
    When you want to lock a function from more than 1 call at a time.
    """

    def decorator(func):
        @functools.wraps(func)
        def inner(*args, **kwargs):
            if key_maker:
                key = key_maker(*args, **kwargs)
            else:
                key = str(args) + str(kwargs)
            lock_key = hashlib.md5(force_bytes(key)).hexdigest()
            with cache.lock(lock_key):
                return func(*args, **kwargs)

        return inner

    return decorator

How To Use It

This has saved my bacon more than once. I use it on functions that really need to be made synchronous. For example, suppose you have a function like this:

def fetch_remote_thing(name):
    try:
        return Thing.objects.get(name=name).result
    except Thing.DoesNotExist:
        # Need to go out and fetch this
        result = some_internet_fetching(name)  # Assume this is sloooow
        Thing.objects.create(name=name, result=result)
        return result

That function is quite dangerous because if executed by two concurrent web requests for example, they will trigger
two "identical" calls to some_internet_fetching and if the database didn't have the name already, it will most likely trigger two calls to Thing.objects.create(name=name, ...) which could lead to integrity errors or if it doesn't the whole function breaks down because it assumes that there is only 1 or 0 of these Thing records.

Easy to solve, just add the lock_decorator:

@lock_decorator()
def fetch_remote_thing(name):
    try:
        return Thing.objects.get(name=name).result
    except Thing.DoesNotExist:
        # Need to go out and fetch this
        result = some_internet_fetching(name)  # Assume this is sloooow
        Thing.objects.create(name=name, result=result)
        return result

Now, thanks to Redis distributed locks, the function is always allowed to finish before it starts another one. All the hairy locking (in particular, the waiting) is implemented deep down in Redis which is rock solid.

Bonus Usage

Another use that has also saved my bacon is functions that aren't necessarily called with the same input argument but each call is so resource intensive that you want to make sure it only does one of these at a time. Suppose you have a Django view function that does some resource intensive work and you want to stagger the calls so that it only runs it one at a time. Like this for example:

def api_stats_calculations(request, part):
    if part == 'users-per-month':
        data = _calculate_users_per_month()  # expensive
    elif part == 'pageviews-per-week':
        data = _calculate_pageviews_per_week()  # intensive
    elif part == 'downloads-per-day':
        data = _calculate_download_per_day()  # slow
    elif you == 'get' and the == 'idea':
        ...

    return http.JsonResponse({'data': data})

If you just put @lock_decorator() on this Django view function, and you have some (almost) concurrent calls to this function, for example from a uWSGI server running with threads and multiple processes, then it will not synchronize the calls.

The solution to this is to write your own function for generating the lock key, like this for example:

@lock_decorator(
    key_maker=lamnbda request, part: 'api_stats_calculations'
)
def api_stats_calculations(request, part):
    if part == 'users-per-month':
        data = _calculate_users_per_month()  # expensive
    elif part == 'pageviews-per-week':
        data = _calculate_pageviews_per_week()  # intensive
    elif part == 'downloads-per-day':
        data = _calculate_download_per_day()  # slow
    elif you == 'get' and the == 'idea':
        ...

    return http.JsonResponse({'data': data})

Now it works.

How Time-Expensive Is It?

Perhaps you worry that 99% of your calls to the function don't have the problem of calling the function concurrently. How much is this overhead of this lock costing you? I wondered that too and set up a simple stress test where I wrote a really simple Django view function. It looked something like this:

@lock_decorator(key_maker=lambda request: 'samekey')
def sample_view_function(request):
    return http.HttpResponse('Ok\n')

I started a Django server with uWSGI with multiple processors and threads enabled. Then I bombarded this function with a simple concurrent stress test and observed the requests per minute. The cost was extremely tiny and almost negligable (compared to not using the lock decorator). Granted, in this test I used Redis on redis://localhost:6379/0 but generally the conclusion was that the call is extremely fast and not something to worry too much about. But your mileage may vary so do your own experiments for your context.

What's Needed

You need to use django-redis as your Django cache backend. I've blogged before about using django-redis, for example Fastest cache backend possible for Django and Fastest Redis configuration for Django.

django-html-validator now supports Django 2.x

13 August 2018 0 comments   Python, Web development, Django

https://pypi.org/project/django-html-validator/


django-html-validator is a Django project that can validate your generated HTML. It does so by sending the HTML to https://html5.validator.nu/ or you can start your own Java server locally with vnu.jar from here. The output is that you can have validation errors printed to stdout or you can have them put as .txt files in a temporary directory. You can also include it in your test suite and make it so that tests fail if invalid HTML is generated during rendering in Django unit tests.

The project seems to have become a lot more popular than I thought it would. It started as a one-evening-hack and because there was interest I wrapped it up in a proper project with "docs" and set up CI for future contributions.

I kinda of forgot the project since almost all my current projects generate JSON on the server and generates the DOM on-the-fly with client-side JavaScript but apparently a lot of issues and PRs were filed related to making it work in Django 2.x. So I took the time last night to tidy up the tox.ini etc. and the necessary compatibility fixes to make it work with but Django 1.8 up to Django 2.1. Pull request here.

Thank you all who contributed! I'll try to make a better job noticing filed issues in the future.

A good Django view function cache decorator for http.JsonResponse

20 June 2018 0 comments   Python, Web development, Django


I use this a lot. It has served me very well. The code:

import hashlib
import functools

import markus  # optional
from django.core.cache import cache
from django import http
from django.utils.encoding import force_bytes, iri_to_uri

metrics = markus.get_metrics(__name__)  # optional


def json_response_cache_page_decorator(seconds):
    """Cache only when there's a healthy http.JsonResponse response."""

    def decorator(func):

        @functools.wraps(func)
        def inner(request, *args, **kwargs):
            cache_key = 'json_response_cache:{}:{}'.format(
                func.__name__,
                hashlib.md5(force_bytes(iri_to_uri(
                    request.build_absolute_uri()
                ))).hexdigest()
            )
            content = cache.get(cache_key)
            if content is not None:

                # metrics is optional
                metrics.incr(
                    'json_response_cache_hit',
                    tags=['view:{}'.format(func.__name__)]
                )

                return http.HttpResponse(
                    content,
                    content_type='application/json'
                )
            response = func(request, *args, **kwargs)
            if (
                isinstance(response, http.JsonResponse) and
                response.status_code in (200, 304)
            ):
                cache.set(cache_key, response.content, seconds)
            return response

        return inner

    return decorator

To use it simply add to Django view functions that might return a http.JsonResponse. For example, something like this:

@json_response_cache_page_decorator(60)
def search(request):
    q = request.GET.get('q')
    if not q:
        return http.HttpResponseBadRequest('no q')
    results = search_database(q)
    return http.JsonResponse({
        'results': results,
    })

The reasons I use this instead of django.views.decorators.cache.cache_page() is because of a couple of reasons.

Disclaimer: This snippet of code comes from a side-project that has a very specific set of requirements. They're rather unique to that project and I have a full picture of the needs. E.g. I know what specific headers matter and don't matter. Your project might be different. For example, perhaps you don't have markus to handle your metrics. Or perhaps you need to re-write the query string for something to normalize the cache key differently. Point being, take the snippet of code as inspiration when you too find that django.views.decorators.cache.cache_page() isn't good enough for your Django view functions.

Always return namespaces in Django REST Framework

11 May 2018 1 comment   Python, Django


By default, when you hook up a model to Django REST Framework and run a query in JSON format, what you get is a list. E.g.

For GET localhost:8000/api/mymodel/

[
  {"id": 1, "name": "Foo"},
  {"id": 2, "name": "Bar"},
  {"id": 3, "name": "Baz"}
]

This isn't great because there's no good way to include other auxiliary data points that are relevant to this query. In Elasticsearch you get something like this:

{
  "took": 106,
  "timed_out": false,
  "_shards": {},
  "hits": {
    "total": 0,
    "hits": [],
    "max_score": 1
  }
}

Another key is that perhaps today you can't think of any immediate reason why you want to include some additonal meta data about the query, but perhaps some day you will.

The way to solve this in Django REST Framework is to override the list function in your Viewset classes.

Before

# views.py
# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

After

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        # Where the magic happens!
        return response

Now, to re-wrap that, the response.data is a OrderedDict which you can change. Here's one way to do it:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        response.data = {
            'hits': response.data,
        }
        return response

And if you want to do the same the "detail API" where you retrieve a single model instance, you can add an override to the retrieve method:

def retrieve(self, request, *args, **kwargs):
    response = super().retrieve(request, *args, **kwargs)
    response.data = {
        'hit': response.data,
    }
    return response

That's it. Perhaps it's personal preference but if you, like me, prefers this style, this is how you do it. I like namespacing things instead of dealing with raw lists.

"Namespaces are one honking great idea -- let's do more of those!"

From import this

Note! This works equally when you enable pagination. Enabling pagination immediately changes the main result from a list to a dictionary. I.e. Instead of...

[
  {"id": 1, "name": "Foo"},
  {"id": 2, "name": "Bar"},
  {"id": 3, "name": "Baz"}
]

you now get...

{
  "count": 3,
  "next": null,
  "previous": null,
  "items": [
    {"id": 1, "name": "Foo"},
    {"id": 2, "name": "Bar"},
    {"id": 3, "name": "Baz"}
  ]
}

So if you apply the "trick" mentioned in this blog post you end up with...:

{
  "hits": {
    "count": 3,
    "next": null,
    "previous": null,
    "items": [
      {"id": 1, "name": "Foo"},
      {"id": 2, "name": "Bar"},
      {"id": 3, "name": "Baz"}
    ]
  }
}

Efficient many-to-many field lookup in Django REST Framework

11 April 2018 0 comments   Python, Django, PostgreSQL


The basic setup

Suppose you have these models:

from django.db import models


class Category(models.Model):
    name = models.CharField(max_length=100)


class Blogpost(models.Model):
    title = models.CharField(max_length=100)
    categories = models.ManyToManyField(Category)

Suppose you hook these up Django REST Framework and list all Blogpost items. Something like this:

# urls.py
from rest_framework import routers
from . import views


router = routers.DefaultRouter()
router.register(r'blogposts', views.BlogpostViewSet)
# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

What's the problem?

Then, if you execute this list (e.g. curl http://localhost:8000/api/blogposts/) what will happen, on the database, is something like this:

SELECT "app_blogpost"."id", "app_blogpost"."title" FROM "app_blogpost";

SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1025;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 193;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 757;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 853;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1116;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1126;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 964;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 591;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1112;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1034;
...

Obviously, it depends on how you define that serializers.BlogpostSerializer class, but basically, as it loops over the Blogpost, for each and every one, it needs to make a query to the many-to-many table (app_blogpost_categories in this example).

That's not going to be performant. In fact, it might be dangerous on your database if the query of blogposts gets big, like requesting a 100 or 1,000 records. Fetching 1,000 rows from the app_blogpost table might be cheap'ish but doing 1,000 selects with JOIN is never going to be cheap. It adds up horribly.

How you solve it

The trick is to only do 1 query on the many-to-many field's table, 1 query on the app_blogpost table and 1 query on the app_category table.

First you have to override the ViewSet.list method. Then, in there you can do exactly what you need.

Here's the framework for this change:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    # queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def get_queryset(self):
        # Chances are, you're doing something more advanced here 
        # like filtering.
        Blogpost.objects.all().order_by('date')

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        # Where the magic happens!

        return response

Next, we need to make a mapping of all Category.id -1-> Category.name. But we want to make sure we do only on the categories that are involved in the Blogpost records that matter. You could do something like this:

category_names = {}
for category in Category.objects.all():
    category_names[category.id] = category.name

But to avoid doing a lookup of category names for those you never need, use the query set on Blogpost. I.e.

qs = self.get_queryset()
all_categories = Category.objects.filter(
    id__in=Blogpost.categories.through.objects.filter(
        blogpost__in=qs
    ).values('category_id')
)
category_names = {}
for category in all_categories:
    category_names[category.id] = category.name

Now you have a dictionary of all the Category IDs that matter.

Note! The above "optimization" assumes that it's worth it. Meaning, if the number of Category records in your database is huge, and the Blogpost queryset is very filtered, then it's worth only extracting a subset. Alternatively, if you only have like 100 different categories in your database, just do the first variant were you look them up "simplestly" without any fancy joins.

Next, is the mapping of Blogpost.id -N-> Category.name. To do that you need to build up a dictionary (int to list of strings). Like this:

categories_map = defaultdict(list)
for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
    categories_map[m2m.blogpost_id].append(
        category_names[m2m.category_id]
    )

So what we have now is a dictionary whose keys are the IDs in self.get_queryset() and each value is a list of a strings. E.g. ['Category X', 'Category Z'] etc.

Lastly, we need to put these back into the serialized response. This feels a little hackish but it works:

for each in response.data:
    each['categories'] = categories_map.get(each['id'], [])

The whole solution looks something like this:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    # queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def get_queryset(self):
        # Chances are, you're doing something more advanced here 
        # like filtering.
        Blogpost.objects.all().order_by('date')

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        qs = self.get_queryset()
        all_categories = Category.objects.filter(
            id__in=Blogpost.categories.through.objects.filter(
                blogpost__in=qs
            ).values('category_id')
        )
        category_names = {}
        for category in all_categories:
            category_names[category.id] = category.name

        categories_map = defaultdict(list)
        for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
            categories_map[m2m.blogpost_id].append(
                category_names[m2m.category_id]
            )

        for each in response.data:
            each['categories'] = categories_map.get(each['id'], [])

        return response

It's arguably not very pretty but doing 3 tight queries instead of doing as many queries as you have records is much better. O(c) is better than O(n).

Discussion

Perhaps the best solution is to not run into this problem. Like, don't serialize any many-to-many fields.

Or, if you use pagination very conservatively, and only allow like 10 items per page then it won't be so expensive to do one query per every many-to-many field.

csso and django-pipeline

28 February 2018 0 comments   Python, Django, Javascript


This is a quick-and-dirty how-to on how to use csso to handle the minification/compression of CSS in django-pipeline.

First create a file called compressors.py somewhere in your project. Make it something like this:

import subprocess
from pipeline.compressors import CompressorBase
from django.conf import settings


class CSSOCompressor(CompressorBase):

    def compress_css(self, css):
        proc = subprocess.Popen(
            [
                settings.PIPELINE['CSSO_BINARY'], 
                '--restructure-off'
            ],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
        )
        css_out = proc.communicate(
            input=css.encode('utf-8')
        )[0].decode('utf-8')
        # was_size = len(css)
        # new_size = len(css_out)
        # print('FROM {} to {} Saved {}  ({!r})'.format(
        #     was_size,
        #     new_size,
        #     was_size - new_size,
        #     css_out[:50]
        # ))
        return css_out

In your settings.py where you configure django-pipeline make it something like this:

PIPELINE = {
    'STYLESHEETS': PIPELINE_CSS,
    'JAVASCRIPT': PIPELINE_JS,

    # These two important lines. 
    'CSSO_BINARY': path('node_modules/.bin/csso'),
    # Adjust the dotted path name to where you put your compressors.py
    'CSS_COMPRESSOR': 'peterbecom.compressors.CSSOCompressor',

    'JS_COMPRESSOR': ...

Next, install csso-cli in your project root (where you have the package.json). It's a bit confusing. The main package is called csso but to have a command line app you need to install csso-cli and when that's been installed you'll have a command line app called csso.

$ yarn add csso-cli

or

$ npm i --save csso-cli

Check that it installed:

$ ./node_modules/.bin/csso --version
3.5.0

And that's it!

--restructure-off

So csso has an advanced feature to restructure the CSS and not just remove whitespace and not needed semicolons. It costs a bit of time to do that so if you want to squeeze the extra milliseconds out, enable it. Trading time for space.
See this benchmark for a comparison with and without --restructure-off in csso.

Why csso you might ask

Check out the latest result from css-minification-benchmark. It's not super easy to read by it seems the best performing one in terms of space (bytes) is crass written by my friend and former colleague @mattbasta . However, by far the fastest is csso when using --restructre-off. Minifiying font-awesome.css with crass takes 326.52 ms versus 3.84 ms in csso.

But what's great about csso is Roman @lahmatiy Dvornov. I call him a friend too for all the help and work he's done on minimalcss (not a CSS minification tool by the way). Roman really understands CSS and csso is actively maintained by him and other smart people who actually get into the scary weeds of CSS browser hacks. That gives me more confidence to recommend csso. Also, squeezing a couple bytes extra out of your .min.css files isn't important when gzip comes into play. It's better that the minification tool is solid and stable.

Check out Roman's slides which, even if you don't read it all, goes to show that CSS minification is so much more than just regex replacing whitespace.
Also crass admits as one of its disadvantages: "Certain "CSS hacks" that use invalid syntax are unsupported".

Conditional aggregation in Django 2.0

12 January 2018 4 comments   Python, Django, PostgreSQL


Django 2.0 came out a couple of weeks ago. It now supports "conditional aggregation" which is SQL standard I didn't even know about.

Before

So I have a Django app which has an endpoint that generates some human-friendly stats about the number of uploads (and their total size) in various different time intervals.

First of all, this is how it set up the time intervals:

today = timezone.now()
start_today = today.replace(hour=0, minute=0, second=0)
start_yesterday = start_today - datetime.timedelta(days=1)
start_this_month = today.replace(day=1)
start_this_year = start_this_month.replace(month=1)

And then, for each of these, there's a little function that returns a dict for each time interval:

def count_and_size(qs, start, end):
    sub_qs = qs.filter(created_at__gte=start, created_at__lt=end)
    return {
        'count': sub_qs.count(),
        'total_size': sub_qs.aggregate(size=Sum('size'))['size'],
}

numbers['uploads'] = {
    'today': count_and_size(upload_qs, start_today, today),
    'yesterday': count_and_size(upload_qs, start_yesterday, start_today),
    'this_month': count_and_size(upload_qs, start_this_month, today),
    'this_year': count_and_size(upload_qs, start_this_year, today),
}

What you get is exactly 2 x 4 = 8 queries. One COUNT and one SUM for each time interval. E.g.

SELECT SUM("upload_upload"."size") AS "size" 
FROM "upload_upload" 
WHERE ("upload_upload"."created_at" >= ...

SELECT COUNT(*) AS "__count" 
FROM "upload_upload" 
WHERE ("upload_upload"."created_at" >= ...

...6 more queries...

Middle

Oops. I think this code comes from a slightly rushed job. We can do the COUNT and the SUM at the same time for each query.

# New, improved count_and_size() function!
def count_and_size(qs, start, end):
    sub_qs = qs.filter(created_at__gte=start, created_at__lt=end)
    return sub_qs.aggregate(
        count=Count('id'),
        total_size=Sum('size'),
    )

numbers['uploads'] = {
    'today': count_and_size(upload_qs, start_today, today),
    'yesterday': count_and_size(upload_qs, start_yesterday, start_today),
    'this_month': count_and_size(upload_qs, start_this_month, today),
    'this_year': count_and_size(upload_qs, start_this_year, today),
}

Much better, now there's only one query per time bucket. So 4 queries in total. E.g.

SELECT COUNT("upload_upload"."id") AS "count", SUM("upload_upload"."size") AS "total_size" 
FROM "upload_upload" 
WHERE ("upload_upload"."created_at" >= ...

...3 more queries...

After

But we can do better than that! Instead, we use conditional aggregation. The syntax gets a bit hairy because there's so many keyword arguments, but I hope I've indented it nicely so it's easy to see how it works:

def make_q(start, end):
    return Q(created_at__gte=start, created_at__lt=end)

q_today = make_q(start_today, today)
q_yesterday = make_q(start_yesterday, start_today)
q_this_month = make_q(start_this_month, today)
q_this_year = make_q(start_this_year, today)

aggregates = upload_qs.aggregate(
    today_count=Count('pk', filter=q_today),
    today_total_size=Sum('size', filter=q_today),

    yesterday_count=Count('pk', filter=q_yesterday),
    yesterday_total_size=Sum('size', filter=q_yesterday),

    this_month_count=Count('pk', filter=q_this_month),
    this_month_total_size=Sum('size', filter=q_this_month),

    this_year_count=Count('pk', filter=q_this_year),
    this_year_total_size=Sum('size', filter=q_this_year),
)
numbers['uploads'] = {
    'today': {
        'count': aggregates['today_count'],
        'total_size': aggregates['today_total_size'],
    },
    'yesterday': {
        'count': aggregates['yesterday_count'],
        'total_size': aggregates['yesterday_total_size'],
    },
    'this_month': {
        'count': aggregates['this_month_count'],
        'total_size': aggregates['this_month_total_size'],
    },
    'this_year': {
        'count': aggregates['this_year_count'],
        'total_size': aggregates['this_year_total_size'],
    },
}

Voila! One single query to get all those pieces of data.
The SQL sent to PostgreSQL looks something like this:

SELECT 
  COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_count", 
  SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_total_size",

  COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_count", 
  SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_total_size", 

  ...

FROM "upload_upload";

Is this the best thing to do? I'm starting to have my doubts.

Watch Out!

When I take this now 1 monster query for a spin with an EXPLAIN ANALYZE prefix I notice something worrying!

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=74.33..74.34 rows=1 width=16) (actual time=0.587..0.587 rows=1 loops=1)
   ->  Seq Scan on upload_upload  (cost=0.00..62.13 rows=813 width=16) (actual time=0.012..0.210 rows=813 loops=1)
 Planning time: 0.427 ms
 Execution time: 0.674 ms
(4 rows)

A sequential scan! That's terrible. The created_at column is indexed in a BTREE so why can't it use the index.

The short answer is: I don't know!
I've uploaded a reduced, but still complete, example demonstrating this in a gist. It's very similar to the example in the stackoverflow question I asked.

So what did I do? I went back to the "middle" solution. One SELECT query per time bucket. So 4 queries in total, but at least all 4 is able to use an index.

When Docker is too slow, use your host

11 January 2018 0 comments   Web development, Django, MacOSX, Docker


I have a side-project that is basically a React frontend, a Django API server and a Node universal React renderer. The killer feature is its Elasticsearch database that searches almost 2.5M large texts and 200K named objects. All the data is stored in a PostgreSQL and there's some Python code that copies that stuff over to Elasticsearch for indexing.

Timings for searches in Songsearch
The PostgreSQL database is about 10GB and the Elasticsearch (version 6.1.0) indices are about 6GB. It's moderately big and even though individual searches take, on average ~75ms (in production) it's hefty. At least for a side-project.

On my MacBook Pro, laptop I use Docker to do development. Docker makes it really easy to run one command that starts memcached, Django, a AWS Product API Node app, create-react-app for the search and a separate create-react-app for the stats web app.

At first I tried to also run PostgreSQL and Elasticsearch in Docker too, but after many attempts I had to just give up. It was too slow. Elasticsearch would keep crashing even though I extended my memory in Docker to 4GB.

This very blog (www.peterbe.com) has a similar stack. Redis, PostgreSQL, Elasticsearch all running in Docker. It works great. One single docker-compose up web starts everything I need. But when it comes to much larger databases, I found my macOS host to be much more performant.

So the dark side of this is that I have remember to do more things when starting work on this project. My PostgreSQL was installed with Homebrew and is always running on my laptop. For Elasticsearch I have to open a dedicated terminal and go to a specific location to start the Elasticsearch for this project (e.g. make start-elasticsearch).

The way I do this is that I have this in my Django projects settings.py:

import dj_database_url
from decouple import config


DATABASES = {
    'default': config(
        'DATABASE_URL',
        # Hostname 'docker.for.mac.host.internal' assumes
        # you have at least Docker 17.12.
        # For older versions of Docker use 'docker.for.mac.localhost'
        default='postgresql://peterbe@docker.for.mac.host.internal/songsearch',
        cast=dj_database_url.parse
    )
}

ES_HOSTS = config('ES_HOSTS', default='docker.for.mac.host.internal:9200', cast=Csv())

(Actually, in reality the defaults in the settings.py code is localhost and I use docker-compose.yml environment variables to override this, but the point is hopefully still there.)

And that's basically it. Now I get Docker to do what various virtualenvs and terminal scripts used to do but the performance of running the big databases on the host.

Really simple Django view function timer decorator

08 December 2017 2 comments   Python, Django


I use this sometimes to get insight into how long some view functions take. Perhaps you find it useful too:


def view_function_timer(prefix='', writeto=print):

    def decorator(func):
        @functools.wraps(func)
        def inner(*args, **kwargs):
            try:
                t0 = time.time()
                return func(*args, **kwargs)
            finally:
                t1 = time.time()
                writeto(
                    'View Function',
                    '({})'.format(prefix) if prefix else '',
                    func.__name__,
                    args[1:],
                    'Took',
                    '{:.2f}ms'.format(1000 * (t1 - t0)),
                    args[0].build_absolute_uri(),
                )
        return inner

    return decorator

And to use it:

from wherever import view_function_timer


@view_function_timer()
def homepage(request, thing):
    ...
    return render(request, template, context)

And then it prints something like this:

View Function  homepage ('valueofthing',) Took 23.22ms http://localhost:8000/home/valueofthing

It's useful when you don't want a full-blown solution to measure all view functions with a middleware or something.
It can be useful also to see how a cache decorator might work:

from django.views.decorators.cache import cache_page
from wherever import view_function_timer


@view_function_timer('possibly cached')
@cache_page(60 * 60 * 2)  # two hours cache
@view_function_timer('not cached')
def homepage(request, thing):
    ...
    return render(request, template, context)

That way you can trace that, with tail -f or something, to see how/if the cacheing decorator works.

There are many better solutions that are more robust but might be a bigger investment. For example, I would recommend markus which, if you don't have a statsd server you can configure to logger.info call the timings.