Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page!
Currently only showing blog entries under the category: Python. Clear filter

Best EXPLAIN ANALYZE benchmark script

19 April 2018 0 comments   PostgreSQL, Python

https://gist.github.com/peterbe/966effb3f357258ddda5aa8ac385b418


tl;dr; Use best-explain-analyze.py to benchmark a SQL query in Postgres.

I often benchmark SQL by extracting the relevant SQL string, prefix it with EXPLAIN ANALYZE, putting it into a file (e.g. benchmark.sql) and then running psql mydatabase < benchmark.sql. That spits out something like this:

                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using main_song_ca949605 on main_song  (cost=0.43..237.62 rows=1 width=4) (actual time=1.586..1.586 rows=0 loops=1)
   Index Cond: (artist_id = 27451)
   Filter: (((name)::text % 'Facing The Abyss'::text) AND (id <> 2856345))
   Rows Removed by Filter: 170
 Planning time: 3.335 ms
 Execution time: 1.701 ms
(6 rows)

Cool. So you study the steps of the query plan and look for "Seq Scan" and various sub-optimal uses of heaps and indices etc. But often, you really want to just look at the Execution time milliseconds number. Especially if you might have to slightly different SQL queries to compare and contrast.

However, as you might have noticed, the number on the Execution time varies between runs. You might think nothing's changed but Postgres might have warmed up some internal caches or your host might be more busy or less busy. To remedy this, you run the EXPLAIN ANALYZE select ... a couple of times to get a feeling for an average. But there's a much better way!

best-explain-analyze.py

Check this out: best-explain-analyze.py

Download it into your ~/bin/ and chmod +x ~/bin/best-explain-analyze.py. I wrote it just this morning so don't judge!

Now, when you run it it runs that thing 10 times (by default) and reports the best Execution time, its mean and its median. Example output:

▶ best-explain-analyze.py songsearch dummy.sql
EXECUTION TIME
    BEST    1.229ms
    MEAN    1.489ms
    MEDIAN  1.409ms
PLANNING TIME
    BEST    1.994ms
    MEAN    4.557ms
    MEDIAN  2.292ms

The "BEST" is an important metric. More important than mean or median.

Raymond Hettinger explains it better than I do. His context is for benchmarking Python code but it's equally applicable:

"Use the min() rather than the average of the timings. That is a recommendation from me, from Tim Peters, and from Guido van Rossum. The fastest time represents the best an algorithm can perform when the caches are loaded and the system isn't busy with other tasks. All the timings are noisy -- the fastest time is the least noisy. It is easy to show that the fastest timings are the most reproducible and therefore the most useful when timing two different implementations."

Efficient many-to-many field lookup in Django REST Framework

11 April 2018 0 comments   PostgreSQL, Django, Python


The basic setup

Suppose you have these models:

from django.db import models


class Category(models.Model):
    name = models.CharField(max_length=100)


class Blogpost(models.Model):
    title = models.CharField(max_length=100)
    categories = models.ManyToManyField(Category)

Suppose you hook these up Django REST Framework and list all Blogpost items. Something like this:

# urls.py
from rest_framework import routers
from . import views


router = routers.DefaultRouter()
router.register(r'blogposts', views.BlogpostViewSet)
# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

What's the problem?

Then, if you execute this list (e.g. curl http://localhost:8000/api/blogposts/) what will happen, on the database, is something like this:

SELECT "app_blogpost"."id", "app_blogpost"."title" FROM "app_blogpost";

SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1025;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 193;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 757;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 853;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1116;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1126;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 964;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 591;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1112;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1034;
...

Obviously, it depends on how you define that serializers.BlogpostSerializer class, but basically, as it loops over the Blogpost, for each and every one, it needs to make a query to the many-to-many table (app_blogpost_categories in this example).

That's not going to be performant. In fact, it might be dangerous on your database if the query of blogposts gets big, like requesting a 100 or 1,000 records. Fetching 1,000 rows from the app_blogpost table might be cheap'ish but doing 1,000 selects with JOIN is never going to be cheap. It adds up horribly.

How you solve it

The trick is to only do 1 query on the many-to-many field's table, 1 query on the app_blogpost table and 1 query on the app_category table.

First you have to override the ViewSet.list method. Then, in there you can do exactly what you need.

Here's the framework for this change:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    # queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def get_queryset(self):
        # Chances are, you're doing something more advanced here 
        # like filtering.
        Blogpost.objects.all().order_by('date')

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        # Where the magic happens!

        return response

Next, we need to make a mapping of all Category.id -1-> Category.name. But we want to make sure we do only on the categories that are involved in the Blogpost records that matter. You could do something like this:

category_names = {}
for category in Category.objects.all():
    category_names[category.id] = category.name

But to avoid doing a lookup of category names for those you never need, use the query set on Blogpost. I.e.

qs = self.get_queryset()
all_categories = Category.objects.filter(
    id__in=Blogpost.categories.through.objects.filter(
        blogpost__in=qs
    ).values('category_id')
)
category_names = {}
for category in all_categories:
    category_names[category.id] = category.name

Now you have a dictionary of all the Category IDs that matter.

Note! The above "optimization" assumes that it's worth it. Meaning, if the number of Category records in your database is huge, and the Blogpost queryset is very filtered, then it's worth only extracting a subset. Alternatively, if you only have like 100 different categories in your database, just do the first variant were you look them up "simplestly" without any fancy joins.

Next, is the mapping of Blogpost.id -N-> Category.name. To do that you need to build up a dictionary (int to list of strings). Like this:

categories_map = defaultdict(list)
for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
    categories_map[m2m.blogpost_id].append(
        category_names[m2m.category_id]
    )

So what we have now is a dictionary whose keys are the IDs in self.get_queryset() and each value is a list of a strings. E.g. ['Category X', 'Category Z'] etc.

Lastly, we need to put these back into the serialized response. This feels a little hackish but it works:

for each in response.data:
    each['categories'] = categories_map.get(each['id'], [])

The whole solution looks something like this:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    # queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def get_queryset(self):
        # Chances are, you're doing something more advanced here 
        # like filtering.
        Blogpost.objects.all().order_by('date')

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        qs = self.get_queryset()
        all_categories = Category.objects.filter(
            id__in=Blogpost.categories.through.objects.filter(
                blogpost__in=qs
            ).values('category_id')
        )
        category_names = {}
        for category in all_categories:
            category_names[category.id] = category.name

        categories_map = defaultdict(list)
        for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
            categories_map[m2m.blogpost_id].append(
                category_names[m2m.category_id]
            )

        for each in response.data:
            each['categories'] = categories_map.get(each['id'], [])

        return response

It's arguably not very pretty but doing 3 tight queries instead of doing as many queries as you have records is much better. O(c) is better than O(n).

Discussion

Perhaps the best solution is to not run into this problem. Like, don't serialize any many-to-many fields.

Or, if you use pagination very conservatively, and only allow like 10 items per page then it won't be so expensive to do one query per every many-to-many field.

hashin 0.12.0 is much much faster

20 March 2018 0 comments   Python


tl;dr; The new 0.12.0 version of hashin is between 6 and 30 times faster.

Version 0.12.0 is exciting because it switches from using https://pypi.python.org/pypi/<package>/json to https://pypi.org/pypi/<package>/json so it's using the new PyPI. I only last week found out about the JSON containing .digest.sha256 as part of the JSON even though apparently it's been there for almost a year!

Prior to version 0.12.0, what hashin used to do is download every tarball and .whl file and run pip on it, in Python, to get the checksum hash. Now, if you use the default sha256, that checksum value is immediately available right there in the JSON, for every file per release. This is especially important for binary packages (lxml for example) where it has to download a lot.

To test this, I cleared my temp directory of any previously downloaded Django-* and lxml-* files and used hashin 0.11.5 to fill a requirements.txt for Django and lxml:

▶ hashin --version
0.11.5
▶ time hashin Django
hashin Django  0.48s user 0.14s system 12% cpu 5.123 total
▶ time hashin lxml
hashin lxml  1.61s user 0.59s system 8% cpu 25.361 total

In other words, 5.1 seconds to get the hashes for Django and 25.4 seconds for lxml.
Now, let's do it with the new 0.12.0

▶ hashin --version
0.12.0
▶ mv requirements.txt 0.11.5-requirements.txt ; touch requirements.txt
▶ time hashin Django
hashin Django  0.34s user 0.06s system 46% cpu 0.860 total
▶ time hashin lxml
hashin lxml  0.35s user 0.06s system 44% cpu 0.909 total

So, instead of 5.1 seconds, now it only takes 0.9 seconds. And instead of 25.4 seconds, now it only takes 0.9 seconds.

Yay!

Note, the old code that downloads and runs pip is still there. It kicks in when you request a digest checksum that is not included in the JSON. For example...:

▶ hashin --version
0.12.0
▶ time hashin --algorithm sha512 lxml
hashin --algorithm sha512 lxml  1.56s user 0.64s system 5% cpu 38.171 total

(The reason this took 38 seconds instead of 25 in the run above is because of the unpredictability of the speed of my home broadband)

Worlds simplest web scraper bot in Python

16 March 2018 1 comment   Python


I just needed a little script to click around a bunch of pages synchronously. It just needed to load the URLs. Not actually do much with the content. Here's what I hacked up:

import random
import requests
from pyquery import PyQuery as pq
from urllib.parse import urljoin


session = requests.Session()
urls = []


def run(url):
    if len(urls) > 100:
        return
    urls.append(url)
    html = session.get(url).decode('utf-8')
    try:
        d = pq(html)
    except ValueError:
        # Possibly weird Unicode errors on OSX due to lxml.
        return
    new_urls = []
    for a in d('a[href]').items():
        uri = a.attr('href')
        if uri.startswith('/') and not uri.startswith('//'):
            new_url = urljoin(url, uri)
            if new_url not in urls:
                new_urls.append(new_url)
    random.shuffle(new_urls)
    [run(x) for x in new_urls]

run('http://localhost:8000/')

If you want to do this when the user is signed in, go to the site in your browser, open the Network tab on your Web Console and copy the value of the Cookie request header.
Change that session.get(url) to something like:

html = session.get(url, headers={'Cookie': 'sessionid=i49q3o66anhvpdaxgldeftsul78bvrpk'}).decode('utf-8')

Now it can spider bot around on your site for a little while as if you're logged in.

Dirty. Simple. Fast.

Now using minimalcss

12 March 2018 0 comments   Node, Javascript, Web development, Python


tl;dr; minimalcss is much better than mincss to slew out the minimal CSS your page needs to render. More accurate and more powerful features. This site now uses minimalcss in inline the minimum CSS needed to render the page.

I started minimalcss back in August 2017 and its goal was ultimately to replace mincss.

The major difference between minimalcss and mincss isn't that one is Node and one is Python, but that minimalcss is based on a full headless browser to handle all the CSS downloading and the proper rendering of the DOM. The other major difference is that mincss was based in regular expressions to analyze the CSS and minimalcss is based on proper abstract syntax tree ("AST") implemented by csso.

Because minimalcss is AST based, it can do a lot more. Smarter. For example, it's able to analyze the CSS to correctly and confidently figure out if any/which keyframe animations and font-face at-rules are actually needed.
Also, because minimalcss is based on csso, when it minifies the CSS it's able to restructure the CSS in a safe and smart way. I.e. p { color: blue; } h2 { color: blue; } becomes p,h2{color:blue}.

So, now I use minimalcss here on this blog. The pages are rendered in Django and a piece of middleware sniffs all outgoing HTML responses and depending on the right conditions it dumps the HTML as a file on disk as path/in/url/index.html. Then, that newly created file is sent to a background worker in Celery which starts post-processing it. Every index.html file is accompanied with the full absolute URL that it belongs to and that's the URL that gets sent to minimalcss which returns the absolute minimal CSS the page needs to load and lastly, a piece of Python script basically does something like this:

From...

<!-- before -->
<link rel="stylesheet" href="/file.css"/>

To...

<!-- after -->
<noscript><link rel="stylesheet" href="/file.css"/></noscript>
<style> ... /* minimal CSS selectors for rendering from /file.css */ ... </style>

There is also a new JavaScript dependency which is the cssrelpreload.js from the loadCSS project. So all the full (original) CSS is still downloaded and inserted into the CSSOM but it happens much later which ultimately means the page can be rendered and useful much sooner than if we'd have to wait to download and parse all of the .css URLs.

I can go into more details if there's interest and others want to do this too. Because this site is all Python and minimalcss is all Node, the integration is done over HTTP on localhost with minimalcss-server.

The results

Unfortunately, this change was mixed in with other smaller optimizations that makes the comparison unfair. (Hey! my personal blog is just a side-project after all). But I downloaded a file before and after the upgrade and compared:

▶ ls -lh *.html
-rw-r--r--  1 peterbe  wheel    19K Mar  7 13:22 after.html
-rw-r--r--  1 peterbe  wheel    96K Mar  7 13:21 before.html

If I extract out the inline style block from both pages and compare it looks like this:
https://gist.github.com/peterbe/fc2fdddd5721fb35a99dc1a50c2b5311

So, downloading the initial HTML document is now 19KB instead of previous 96KB. And visually there's absolutely no difference.

Granted, in the after.html version, a piece of JavaScript kicks in and downloads /static/css/base.min.91f6fc577a60.css and /static/css/base-dynamic.min.e335b9bfa0b1.css from the CDN. So you have to download these too:

▶ ls -lh *.css.gz
-rw-r--r--  1 peterbe  wheel   5.0K Mar  7 10:24 base-dynamic.min.e335b9bfa0b1.css.gz
-rw-r--r--  1 peterbe  wheel    95K Mar  7 10:24 base.min.91f6fc577a60.css.gz

The reason the difference appears to be huge is because I changed a couple of other things around the same time. Sorry. For example, certain DOM nodes were rendered as HTML but made hidden until some jQuery script made it not hidden anymore. For example, the "dimmer" effect over a comment textarea after you hit the submit button. Now, I've changed the jQuery code to build up the DOM when it needs it rather than relying on it being there (hidden). This means that certain base64 embedded font-faces are no longer needed in the minimal CSS payload.

Why this approach is better

So the old approach was to run mincss on the HTML and inject that as an inline style block and throw away the original (relevant) <link rel="stylesheet" href="..."> tags.
That had the annoying drawback that there was CSS in the stylesheets that I knew was going to be needed by some XHR or JavaScript later. For example, if you post a comment some jQuery code changes the DOM and that new DOM needs these CSS selectors later. So I had to do things like this:

.project a.perm { /* no mincss */
    font-size: 0.7em;
    padding-left: 8px;
}
.project a.perm:link { /* no mincss */
    color: rgb(151,151,151);
}
.project a.perm:hover { /* no mincss */
    color: rgb(51,51,51);
}

This was to inform mincss to leave those untouched even though no DOM node uses them right now. With minimalcss this is no longer needed.

What's next?

Keep working on minimalcss and make it even better.

Also, the scripting I used to modify the HTML file is a hack and should probably be put into the minimalcss project.

Last but not least, every time I put in some effort to web performance optimize my blog pages my Google ranking goes up and I usually see an increase in Google referrals in my Google Analytics because it's pretty obvious that Google loves fast sites. So I'm optimistically waiting for that effect.

csso and django-pipeline

28 February 2018 0 comments   Javascript, Django, Python


This is a quick-and-dirty how-to on how to use csso to handle the minification/compression of CSS in django-pipeline.

First create a file called compressors.py somewhere in your project. Make it something like this:

import subprocess
from pipeline.compressors import CompressorBase
from django.conf import settings


class CSSOCompressor(CompressorBase):

    def compress_css(self, css):
        proc = subprocess.Popen(
            [
                settings.PIPELINE['CSSO_BINARY'], 
                '--restructure-off'
            ],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
        )
        css_out = proc.communicate(
            input=css.encode('utf-8')
        )[0].decode('utf-8')
        # was_size = len(css)
        # new_size = len(css_out)
        # print('FROM {} to {} Saved {}  ({!r})'.format(
        #     was_size,
        #     new_size,
        #     was_size - new_size,
        #     css_out[:50]
        # ))
        return css_out

In your settings.py where you configure django-pipeline make it something like this:

PIPELINE = {
    'STYLESHEETS': PIPELINE_CSS,
    'JAVASCRIPT': PIPELINE_JS,

    # These two important lines. 
    'CSSO_BINARY': path('node_modules/.bin/csso'),
    # Adjust the dotted path name to where you put your compressors.py
    'CSS_COMPRESSOR': 'peterbecom.compressors.CSSOCompressor',

    'JS_COMPRESSOR': ...

Next, install csso-cli in your project root (where you have the package.json). It's a bit confusing. The main package is called csso but to have a command line app you need to install csso-cli and when that's been installed you'll have a command line app called csso.

$ yarn add csso-cli

or

$ npm i --save csso-cli

Check that it installed:

$ ./node_modules/.bin/csso --version
3.5.0

And that's it!

--restructure-off

So csso has an advanced feature to restructure the CSS and not just remove whitespace and not needed semicolons. It costs a bit of time to do that so if you want to squeeze the extra milliseconds out, enable it. Trading time for space.
See this benchmark for a comparison with and without --restructure-off in csso.

Why csso you might ask

Check out the latest result from css-minification-benchmark. It's not super easy to read by it seems the best performing one in terms of space (bytes) is crass written by my friend and former colleague @mattbasta. However, by far the fastest is csso when using --restructre-off. Minifiying font-awesome.css with crass takes 326.52 ms versus 3.84 ms in csso.

But what's great about csso is Roman @lahmatiy Dvornov. I call him a friend too for all the help and work he's done on minimalcss (not a CSS minification tool by the way). Roman really understands CSS and csso is actively maintained by him and other smart people who actually get into the scary weeds of CSS browser hacks. That gives me more confidence to recommend csso. Also, squeezing a couple bytes extra out of your .min.css files isn't important when gzip comes into play. It's better that the minification tool is solid and stable.

Check out Roman's slides which, even if you don't read it all, goes to show that CSS minification is so much more than just regex replacing whitespace.
Also crass admits as one of its disadvantages: "Certain "CSS hacks" that use invalid syntax are unsupported".

Fastest way to unzip a zip file in Python

31 January 2018 10 comments   Python


So the context is this; a zip file is uploaded into a web service and Python then needs extract that and analyze and deal with each file within. In this particular application what it does is that it looks at the file's individual name and size, compares that to what has already been uploaded in AWS S3 and if the file is believed to be different or new, it gets uploaded to AWS S3.

Uploads today
The challenge is that these zip files that come in are huuuge. The average is 560MB but some are as much as 1GB. Within them, there are mostly plain text files but there are some binary files in there too that are huge. It's not unusual that each zip file contains 100 files and 1-3 of those make up 95% of the zip file size.

At first I tried unzipping the file, in memory, and deal with one file at a time. That failed spectacularly with various memory explosions and EC2 running out of memory. I guess it makes sense. First you have the 1GB file in RAM, then you unzip each file and now you have possibly 2-3GB all in memory. So, the solution, after much testing, was to dump the zip file to disk (in a temporary directory in /tmp) and then iterate over the files. This worked much better but I still noticed the whole unzipping was taking up a huge amount of time. Is there perhaps a way to optimize that?

Baseline function

First it's these common functions that simulate actually doing something with the files in the zip file:

def _count_file(fn):
    with open(fn, 'rb') as f:
        return _count_file_object(f)

def _count_file_object(f):
    # Note that this iterates on 'f'.
    # You *could* do 'return len(f.read())'
    # which would be faster but potentially memory 
    # inefficient and unrealistic in terms of this 
    # benchmark experiment. 
    total = 0
    for line in f:
        total += len(line)
    return total

Here's the simplest one possible:

def f1(fn, dest):
    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        zf.extractall(dest)

    total = 0
    for root, dirs, files in os.walk(dest):
        for file_ in files:
            fn = os.path.join(root, file_)
            total += _count_file(fn)
    return total

If I analyze it a bit more carefully, I find that it spends about 40% doing the extractall and 60% doing the looping over files and reading their full length.

First attempt

My first attempt was to try to use threads. You create an instance of zipfile.ZipFile, extract every file name within and start a thread for each name. Each thread is given a function that does the "meat of the work" (in this benchmark, iterating over the file and getting its total size). In reality that function does a bunch of complicated S3, Redis and PostgreSQL stuff but in my benchmark I just made it a function that figures out the total length of file. The thread pool function:

def f2(fn, dest):

    def unzip_member(zf, member, dest):
        zf.extract(member, dest)
        fn = os.path.join(dest, member.filename)
        return _count_file(fn)

    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        futures = []
        with concurrent.futures.ThreadPoolExecutor() as executor:
            for member in zf.infolist():
                futures.append(
                    executor.submit(
                        unzip_member,
                        zf,
                        member,
                        dest,
                    )
                )
            total = 0
            for future in concurrent.futures.as_completed(futures):
                total += future.result()
    return total

Result: ~10% faster

Second attempt

So perhaps the GIL is blocking me. The natural inclination is to try to use multiprocessing to spread the work across multiple available CPUs. But doing so has the disadvantage that you can't pass around a non-pickleable object so you have to send just the filename to each future function:

def unzip_member_f3(zip_filepath, filename, dest):
    with open(zip_filepath, 'rb') as f:
        zf = zipfile.ZipFile(f)
        zf.extract(filename, dest)
    fn = os.path.join(dest, filename)
    return _count_file(fn)



def f3(fn, dest):
    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        futures = []
        with concurrent.futures.ProcessPoolExecutor() as executor:
            for member in zf.infolist():
                futures.append(
                    executor.submit(
                        unzip_member_f3,
                        fn,
                        member.filename,
                        dest,
                    )
                )
            total = 0
            for future in concurrent.futures.as_completed(futures):
                total += future.result()
    return total

Result: ~300% faster

That's cheating!

The problem with using a pool of processors is that it requires that the original .zip file exists on disk. So in my web server, to use this solution, I'd first have to save the in-memory ZIP file to disk, then invoke this function. Not sure what the cost of that it's not likely to be cheap.

Well, it doesn't hurt to poke around. Perhaps, it could be worth it if the extraction was significantly faster.

But remember! This optimization depends on using up as many CPUs as it possibly can. What if some of those other CPUs are needed for something else going on in gunicorn? Those other processes would have to patiently wait till there's a CPU available. Since there's other things going on in this server, I'm not sure I'm willing to let on process take over all the other CPUs.

Conclusion

Doing it serially turns out to be quite nice. You're bound to one CPU but the performance is still pretty good. Also, just look at the difference in the code between f1 and f2! With concurrent.futures pool classes you can cap the number of CPUs it's allowed to use but that doesn't feel great either. What if you get the number wrong in a virtual environment? Of if the number is too low and don't benefit any from spreading the workload and now you're just paying for overheads to move the work around?

I'm going to stick with zipfile.ZipFile(file_buffer).extractall(temp_dir). It's good enough for this.

Want to try your hands on it?

I did my benchmarking using a c5.4xlarge EC2 server. The files can be downloaded from:

wget https://www.peterbe.com/unzip-in-parallel/hack.unzip-in-parallel.py
wget https://www.peterbe.com/unzip-in-parallel/symbols-2017-11-27T14_15_30.zip

The .zip file there is 34MB which is relatively small compared to what's happening on the server.

The hack.unzip-in-parallel.py is a hot mess. It contains a bunch of terrible hacks and ugly stuff but hopefully it's a start.

Conditional aggregation in Django 2.0

12 January 2018 4 comments   PostgreSQL, Django, Python


Django 2.0 came out a couple of weeks ago. It now supports "conditional aggregation" which is SQL standard I didn't even know about.

Before

So I have a Django app which has an endpoint that generates some human-friendly stats about the number of uploads (and their total size) in various different time intervals.

First of all, this is how it set up the time intervals:

today = timezone.now()
start_today = today.replace(hour=0, minute=0, second=0)
start_yesterday = start_today - datetime.timedelta(days=1)
start_this_month = today.replace(day=1)
start_this_year = start_this_month.replace(month=1)

And then, for each of these, there's a little function that returns a dict for each time interval:

def count_and_size(qs, start, end):
    sub_qs = qs.filter(created_at__gte=start, created_at__lt=end)
    return {
        'count': sub_qs.count(),
        'total_size': sub_qs.aggregate(size=Sum('size'))['size'],
}

numbers['uploads'] = {
    'today': count_and_size(upload_qs, start_today, today),
    'yesterday': count_and_size(upload_qs, start_yesterday, start_today),
    'this_month': count_and_size(upload_qs, start_this_month, today),
    'this_year': count_and_size(upload_qs, start_this_year, today),
}

What you get is exactly 2 x 4 = 8 queries. One COUNT and one SUM for each time interval. E.g.

SELECT SUM("upload_upload"."size") AS "size" 
FROM "upload_upload" 
WHERE ("upload_upload"."created_at" >= ...

SELECT COUNT(*) AS "__count" 
FROM "upload_upload" 
WHERE ("upload_upload"."created_at" >= ...

...6 more queries...

Middle

Oops. I think this code comes from a slightly rushed job. We can do the COUNT and the SUM at the same time for each query.

# New, improved count_and_size() function!
def count_and_size(qs, start, end):
    sub_qs = qs.filter(created_at__gte=start, created_at__lt=end)
    return sub_qs.aggregate(
        count=Count('id'),
        total_size=Sum('size'),
    )

numbers['uploads'] = {
    'today': count_and_size(upload_qs, start_today, today),
    'yesterday': count_and_size(upload_qs, start_yesterday, start_today),
    'this_month': count_and_size(upload_qs, start_this_month, today),
    'this_year': count_and_size(upload_qs, start_this_year, today),
}

Much better, now there's only one query per time bucket. So 4 queries in total. E.g.

SELECT COUNT("upload_upload"."id") AS "count", SUM("upload_upload"."size") AS "total_size" 
FROM "upload_upload" 
WHERE ("upload_upload"."created_at" >= ...

...3 more queries...

After

But we can do better than that! Instead, we use conditional aggregation. The syntax gets a bit hairy because there's so many keyword arguments, but I hope I've indented it nicely so it's easy to see how it works:

def make_q(start, end):
    return Q(created_at__gte=start, created_at__lt=end)

q_today = make_q(start_today, today)
q_yesterday = make_q(start_yesterday, start_today)
q_this_month = make_q(start_this_month, today)
q_this_year = make_q(start_this_year, today)

aggregates = upload_qs.aggregate(
    today_count=Count('pk', filter=q_today),
    today_total_size=Sum('size', filter=q_today),

    yesterday_count=Count('pk', filter=q_yesterday),
    yesterday_total_size=Sum('size', filter=q_yesterday),

    this_month_count=Count('pk', filter=q_this_month),
    this_month_total_size=Sum('size', filter=q_this_month),

    this_year_count=Count('pk', filter=q_this_year),
    this_year_total_size=Sum('size', filter=q_this_year),
)
numbers['uploads'] = {
    'today': {
        'count': aggregates['today_count'],
        'total_size': aggregates['today_total_size'],
    },
    'yesterday': {
        'count': aggregates['yesterday_count'],
        'total_size': aggregates['yesterday_total_size'],
    },
    'this_month': {
        'count': aggregates['this_month_count'],
        'total_size': aggregates['this_month_total_size'],
    },
    'this_year': {
        'count': aggregates['this_year_count'],
        'total_size': aggregates['this_year_total_size'],
    },
}

Voila! One single query to get all those pieces of data.
The SQL sent to PostgreSQL looks something like this:

SELECT 
  COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_count", 
  SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_total_size",

  COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_count", 
  SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_total_size", 

  ...

FROM "upload_upload";

Is this the best thing to do? I'm starting to have my doubts.

Watch Out!

When I take this now 1 monster query for a spin with an EXPLAIN ANALYZE prefix I notice something worrying!

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=74.33..74.34 rows=1 width=16) (actual time=0.587..0.587 rows=1 loops=1)
   ->  Seq Scan on upload_upload  (cost=0.00..62.13 rows=813 width=16) (actual time=0.012..0.210 rows=813 loops=1)
 Planning time: 0.427 ms
 Execution time: 0.674 ms
(4 rows)

A sequential scan! That's terrible. The created_at column is indexed in a BTREE so why can't it use the index.

The short answer is: I don't know!
I've uploaded a reduced, but still complete, example demonstrating this in a gist. It's very similar to the example in the stackoverflow question I asked.

So what did I do? I went back to the "middle" solution. One SELECT query per time bucket. So 4 queries in total, but at least all 4 is able to use an index.

Understanding Redis hash-max-ziplist-entries

08 January 2018 0 comments   Redis, Python


This is an advanced topic for people who do serious stuff in Redis. I need to do serious stuff in Redis so I'm trying to learn about the best way to store lots of keys with hash maps.

It seems that this article by Salvatore Sanfilippo (creator of Redis) himself seems to be a much cited article for this topic. If you haven't read it, the gist is that Redis can employ some clever optimizations for storing hash maps in a very memory efficient way instead of storing each key-value separately.

"Hashes, Lists, Sets composed of just integers, and Sorted Sets, when smaller than a given number of elements, and up to a maximum element size, are encoded in a very memory efficient way that uses up to 10 times less memory (with 5 time less memory used being the average saving)"

This efficient storage optimization is called a ziplist.

How does that work?

Suppose you have 1,000 keys (with their values) if you store them like this:

SET keyprefix0001 valueX
SET keyprefix0002 valueY
SET keyprefix0003 valueN
...
INFO

then, the total memory footprint of the Redis DB is linear. The total memory footprint is roughly that of the length of the keys and values.
Alternatively, if you use hash maps (essentially storing dictionaries instead of individual values) you can benefit from some of those "tricks". To do that, you need to "bucketize" or "shard" the keys. Basically, some way to lump together many keys under one "parent key". (Note this is only applicable if you have huge number of keys)

In this article for one of the founders of Instagram, Mike Krieger, proposes a technique of dividing his keys (since they're all numbers) by 1000 and use the "remainder" as the key inside the hash map. So an original key like "1155315"'s prefix becomes "1155" (1155315 / 1000 = 1155) so he stores:

HSET "1155" "1155315" THEVALUE

Since the numbers are all integers the first 4 characters means there are 10,000 different combinations. I.e. a total 10,000 hash maps.

So when you need to look up the value of key "1155315" you apply the same logic you did when you inserted it.

HGET "1155" "1155315"

So this is where hash-max-ziplist-entries comes in. It's a configuration setting. In my Redis 3.2 install (both in an official Redis Docker image and on my Homebrew install) this number is 512. So what does that mean?

I believe, it means the maximum number of keys per hash map to benefit from the internal space optimization storage.

Real experiment

To demonstrate my understanding to myself, I wrote a hacky script that stores 1,000,000 keys with different "bucketization algorithms" (yeah, I made that term up). Depending how you run the script, it splits the keys (which are all integers) into different number of buckets. For example, if you integer divide each key by 500 you get 2,000 different hash map buckets, aka. total number of keys in the database.

So I run that script with a bunch of different sizes, flush the database between each run and at the end of every run I log how big the whole database is in number of megabytes. Draw a graph of this and this is what you get:

Redis keys per hash map

What's happening here is that since my hash-max-ziplist-entries is 512, around there, Redis does not want to store each hash map in space efficient way and stores it has plain key values.

In other words, when each hash map has more than 502 keys, memory usage shoots up.

Sanity check

It's no surprise that the more keys (and values) you store, the bigger the total size of the database. Here's a graph I used when inserting 10,000 integer keys, 20,000 integer keys etc.

The three lines represent 3 different batches of experiments. SET means it inserts N keys with the simple SET operator. HSET (1,000) inserts N keys with each key bucket is 1,000 keys. And lastly, HSET (500) means each hash map bucket has at most 500 keys.

SET vs. HSET vs HSET (smart)

I guess you could say they're all linear but note that they all store the exact same total amount of information.

What's the catch?

I honestly don't know but one thing is clear; there's no such thing as a free lunch.

In the memory-optimization article by Salvatore he alludes that "... a clear tradeoff between many things: CPU, memory, max element size.". So perhaps this neat trick that saves so much memory can comes at a cost of more CPU resource work. At the moment I don't know of a good way to measure this. So far my focus has been on the optimal way of storing things.

Also, I can only speculate but common sense smells like if that number (hash-max-ziplist-entries) is very large, Redis might allocate more memory for the tree it might store than it actually stores. Meaning if I set it to, say, 10,000 but most of my hash maps only have around 1,000 keys. Does that mean that the total memory usage goes up with empty memory allocations?

What to do

Why it's set to 512 by default in Redis 3.2 I don't know, but I know you can change it!

redis-store:6379> config get hash-*
1) "hash-max-ziplist-entries"
2) "512"
3) "hash-max-ziplist-value"
4) "64"
redis-store:6379> config set hash-max-ziplist-entries 1024
OK

AWS ElastiCache parameter groups

When I make that change and run the original script again but using 1,000 keys per bucket, the whole thing only uses 12MB. So it works.

I haven't actually tested this in AWS ElastiCache but I did try creating a new ElastiCache Parameter Group and there's an input for it.

According to this page it seems it's possible to change it with "Azure Redis Cache" too.

Lastly, I wish there was a way to run Redis where it spits out in a log file somewhere whenever your trip certain thresholds.

Whatsdeployed facelift

05 January 2018 0 comments   Docker, Mozilla, Web development, Python

https://whatsdeployed.io/


tl;dr; Whatsdeployed.io is an impressively simple web app to help web developers and web ops people quickly see what GitHub commits have made it into your Dev, Stage or Prod environment. Today it got a facelift.

The code is now more than 5 years old and has served me well. It's weird to talk too positively about the app because I actually wrote it but because it's so simple in terms of design and effort it feels less personal to talk about it.

Here's what's in the facelift

Please let me know if there's anything broken or missing.