Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page!
Currently only showing blog entries under the category: Python. Clear filter

GeneratorExit - How to clean up after the last yield in Python

07 June 2018 2 comments   Python


tl;dr; Use except GeneratorExit if your Python generator needs to know the consumer broke out.

Suppose you have a generator that yields things out. After each yield you want to execute some code that does something like logging or cleaning up. Here one such trivialized example:

The Problem

def pump():
    numbers = [1, 2, 3, 4]
    for number in numbers:
        yield number
        print("Have sent", number)
    print("Last number was sent")


for number in pump():
    print("Got", number)

print("All done")

The output is, as expected:

Got 1
Have sent 1
Got 2
Have sent 2
Got 3
Have sent 3
Got 4
Have sent 4
Last number was sent
All done

In this scenario, the consumer of the generator (the for number in pump() loop in this example) gets every thing the generator generates so after the last yield the generator is free to do any last minute activities which might be important such as closing a socket or updating a database.

Suppose the consumer is getting a bit "impatient" and breaks out as soon as it has what it needed.

def pump():
    numbers = [1, 2, 3, 4]
    for number in numbers:
        yield number
        print("Have sent", number)
    print("Last number was sent")


for number in pump():
    print("Got", number)
    # THESE TWO NEW LINES
    if number >= 2:
        break

print("All done")

What do you think the out is now? I'll tell you:

Got 1
Have sent 1
Got 2
All done

In other words, the potentially important lines print("Have sent", number) and print("Last number was sent") never gets executed! The generator could tell the consumer (through documentation) of the generator "Don't break! If you don't want me any more raise a StopIteration". But that's not a feasible requirement.

The Solution

But! There is a better solution and that's to catch GeneratorExit exceptions.

def pump():
    numbers = [1, 2, 3, 4]
    try:
        for number in numbers:
            yield number
            print("Have sent", number)
    except GeneratorExit:
        print("Exception!")
    print("Last number was sent")

for number in pump():
    print("Got", number)
    if number == 2:
        break
print("All done")

Now you get what you might want:

Got 1
Have sent 1
Got 2
Exception!
Last number was sent
All done

Next Level Stuff

Note in the last example's output, it never prints Have sent 2 even though the generator really did send that number. Suppose that's an important piece of information, then you can reach that inside the except GeneratorExit block. Like this for example:

def pump():
    numbers = [1, 2, 3, 4]
    try:
        for number in numbers:
            yield number
            print("Have sent", number)
    except GeneratorExit:
        print("Have sent*", number)
    print("Last number was sent")


for number in pump():
    print("Got", number)
    if number == 2:
        break
print("All done")

And the output is:

Got 1
Have sent 1
Got 2
Have sent* 2
Last number was sent
All done

The * is just in case we wanted to distinguish between a break happening or not. Depends on your application.

Writing a custom Datadog reporter using the Python API

21 May 2018 2 comments   Python

https://github.com/peterbe/buildhub-datadoggy


Datadog is an awesome sofware-as-a-service where you can aggregate and visualize statsd metrics sent from an application. For visualizing timings you create a time series graph. It can look something like this:

Time series

This time series looks sane because because it's timings made very very frequently. But what if it happens very rarely. Like once a day. Then, the graph doesn't look very useful. See this example:

"Rare time" series

Not only is it happening rarely, the amount of seconds is really quite hard to parse. I.e. what's 2.6 million milliseconds (answer is approximately 45 minutes). So to solve that I used the Datadog API. Now I can get a metric of every single point in milliseconds and I can make a little data table with human-readable dates and times.

The end result looks something like this:

SCOPE: ENV:PROD
+-------------------------+------------------------+-----------------------+
|          WHEN           |        TIME AGO        |       TIME TOOK       |
+=========================+========================+=======================+
| Mon 2018-05-21T17:00:00 | 2 hours 43 minutes ago | 23 minutes 32 seconds |
+-------------------------+------------------------+-----------------------+
| Sun 2018-05-20T17:00:00 | 1 day 2 hours ago      | 20 seconds            |
+-------------------------+------------------------+-----------------------+
| Sat 2018-05-19T17:00:00 | 2 days 2 hours ago     | 20 seconds            |
+-------------------------+------------------------+-----------------------+
| Fri 2018-05-18T17:00:00 | 3 days 2 hours ago     | 2 minutes 24 seconds  |
+-------------------------+------------------------+-----------------------+
| Wed 2018-05-16T20:00:00 | 4 days 23 hours ago    | 38 minutes 38 seconds |
+-------------------------+------------------------+-----------------------+

It's not gorgeous and there are a lot of caveats but it's at least really easy to read. See the metrics.py code here.

I don't think you can run this code since you don't have the same (hardcoded) metrics but hopefully it can serve as an example to whet your appetite.

What I'm going to do next, if I have time, is to run this as a Flask app instead that outputs a HTML table on a Herokup app or something.

Rust > Go > Python ...to parse millions of dates in CSV files

15 May 2018 8 comments   Python


It all started so innocently. The task at hand was to download an inventory of every single file ever uploaded to a public AWS S3 bucket. The way that works is that you download the root manifest.json. It references a boat load of .csv.gz files. So to go through every single file uploaded to the bucket, you read the manifest.json, the download each and every .csv.gz file. Now you can parse these and do something with each row. An example row in one of the CSV files looks like this:

"net-mozaws-prod-delivery-firefox","pub/firefox/nightly/latest-mozilla-central-l10n/linux-i686/xpi/firefox-57.0a1.gn.langpack.xpi","474348","2017-09-21T13:08:25.000Z","0af6ce0b494d1f380a8b1cd6f42e36cb"

In the Mozilla Buildhub what we do is we periodically do this, in Python (with asyncio), to spot if there are any files in the S3 bucket have potentially missed to record in an different database.
But ouf the 150 or so .csv.gz files, most of the files are getting old and in this particular application we can be certain it's unlikely to be relevant and can be ignored. To come to that conclusion you parse each .csv.gz file, parse each row of the CSV, extract the last_modified value (e.g. 2017-09-21T13:08:25.000Z) into a datetime.datetime instance. Now you can quickly decide if it's too old or recent enough to go through the other various checks.

So, the task is to parse 150 .csv.gz files totalling about 2.5GB with roughly 75 million rows. Basically parsing the date strings into datetime.datetime objects 75 million times.

Python

What this script does is it opens, synchronously, each and every .csv.gz file, parses each records date and compares it to a constant ("Is this record older than 6 months or not?")

This is an extraction of a bigger system to just look at the performance of parsing all those .csv.gz files to figure out how many are old and how many are within 6 months. Code looks like this:

import datetime
import gzip
import csv
from glob import glob

cutoff = datetime.datetime.now() - datetime.timedelta(days=6 * 30)

def count(fn):
    count = total = 0
    with gzip.open(fn, 'rt') as f:
        reader = csv.reader(f)
        for line in reader:
            lastmodified = datetime.datetime.strptime(
                line[3],
                '%Y-%m-%dT%H:%M:%S.%fZ'
            )
            if lastmodified > cutoff:
                count += 1
            total += 1

    return total, count


def run():
    total = recent = 0
    for fn in glob('*.csv.gz'):
        if len(fn) == 39:  # filter out other junk files that don't fit the pattern
            print(fn)
            t, c = count(fn)
            total += t
            recent += c

    print(total)
    print(recent)
    print('{:.1f}%'.format(100 * recent / total))


run()

Code as a gist here.

Only problem. This is horribly slow.

To reproduce this, I took a sample of 38 of these .csv.gz files and ran the above code with CPython 3.6.5. It took 3m44s on my 2017 MacBook Pro.

Let's try a couple low-hanging fruit ideas:

Hmm... Clearly this is CPU bound and using multiple processes is the ticket. But what's really holding us back is the date parsing. From the "Fastest Python datetime parser" benchmark the trick is to use ciso8601. Alright, let's try that. Diff:

6c6,10
< cutoff = datetime.datetime.now() - datetime.timedelta(days=6 * 30)
---
> import ciso8601
>
> cutoff = datetime.datetime.utcnow().replace(
>     tzinfo=datetime.timezone.utc
> ) - datetime.timedelta(days=6 * 30)
14,17c18
<             lastmodified = datetime.datetime.strptime(
<                 line[3],
<                 '%Y-%m-%dT%H:%M:%S.%fZ'
<             )
---
>             lastmodified = ciso8601.parse_datetime(line[3])

Version with ciso8601 here.

So what originally took 3 and a half minutes now takes 18 seconds. I suspect that's about as good as it gets with Python.
But it's not too shabby. Parsing 12,980,990 date strings in 18 seconds. Not bad.

Go

My Go is getting rusty but it's quite easy to write one of these so I couldn't resist the temptation:

package main

import (
    "compress/gzip"
    "encoding/csv"
    "fmt"
    "log"
    "os"
    "path/filepath"
    "strconv"
    "time"
)

func count(fn string, index int) (int, int) {
    fmt.Printf("%d %v\n", index, fn)
    f, err := os.Open(fn)
    if err != nil {
        log.Fatal(err)
    }
    defer f.Close()
    gr, err := gzip.NewReader(f)
    if err != nil {
        log.Fatal(err)
    }
    defer gr.Close()

    cr := csv.NewReader(gr)
    rec, err := cr.ReadAll()
    if err != nil {
        log.Fatal(err)
    }
    var count = 0
    var total = 0
    layout := "2006-01-02T15:04:05.000Z"

    minimum, err := time.Parse(layout, "2017-11-02T00:00:00.000Z")
    if err != nil {
        log.Fatal(err)
    }

    for _, v := range rec {
        last_modified := v[3]

        t, err := time.Parse(layout, last_modified)
        if err != nil {
            log.Fatal(err)
        }
        if t.After(minimum) {
            count += 1
        }
        total += 1
    }
    return total, count
}

func FloatToString(input_num float64) string {
    // to convert a float number to a string
    return strconv.FormatFloat(input_num, 'f', 2, 64)
}

func main() {
    var pattern = "*.csv.gz"

    files, err := filepath.Glob(pattern)
    if err != nil {
        panic(err)
    }
    total := int(0)
    recent := int(0)
    for i, fn := range files {
        if len(fn) == 39 {
            // fmt.Println(fn)
            c, t := count(fn, i)
            total += t
            recent += c
        }
    }
    fmt.Println(total)
    fmt.Println(recent)
    ratio := float64(recent) / float64(total)
    fmt.Println(FloatToString(100.0 * ratio))
}

Code as as gist here.

Using go1.10.1 I run go make main.go and then time ./main. This takes just 20s which is about the time it took the Python version that uses a process pool and ciso8601.

I showed this to my colleague @mostlygeek who saw my scripts and did the Go version more properly with its own repo.
At first pass (go build filter.go and time ./filter) this one clocks in at 19s just like my naive initial hack. However if you run this as time GOPAR=2 ./filter it will use 8 workers (my MacBook Pro as 8 CPUs) and now it only takes: 5.3s.

By the way, check out @mostlygeek's download.sh if you want to generate and download yourself a bunch of these .csv.gz files.

Rust

First @mythmon stepped up and wrote two versions. One single-threaded and one using rayon which will use all CPUs you have.

The version using rayon looks like this (single-threaded version here):

extern crate csv;
extern crate flate2;
#[macro_use]
extern crate serde_derive;
extern crate chrono;
extern crate rayon;

use std::env;
use std::fs::File;
use std::io;
use std::iter::Sum;

use chrono::{DateTime, Utc, Duration};
use flate2::read::GzDecoder;
use rayon::prelude::*;

#[derive(Debug, Deserialize)]
struct Row {
    bucket: String,
    key: String,
    size: usize,
    last_modified_date: DateTime<Utc>,
    etag: String,
}

struct Stats {
    total: usize,
    recent: usize,
}

impl Sum for Stats {
    fn sum<I: Iterator<Item=Self>>(iter: I) -> Self {
        let mut acc = Stats { total: 0, recent: 0 };
        for stat in  iter {
            acc.total += stat.total;
            acc.recent += stat.recent;
        }
        acc
    }
}

fn main() {
    let cutoff = Utc::now() - Duration::days(180);
    let filenames: Vec<String> = env::args().skip(1).collect();

    let stats: Stats = filenames.par_iter()
        .map(|filename| count(&filename, cutoff).expect(&format!("Couldn't read {}", &filename)))
        .sum();

    let percent = 100.0 * stats.recent as f32 / stats.total as f32;
    println!("{} / {} = {:.2}%", stats.recent, stats.total, percent);
}

fn count(path: &str, cutoff: DateTime<Utc>) -> Result<Stats, io::Error> {
    let mut input_file = File::open(&path)?;
    let decoder = GzDecoder::new(&mut input_file)?;
    let mut reader = csv::ReaderBuilder::new()
        .has_headers(false)
        .from_reader(decoder);

    let mut total = 0;
    let recent = reader.deserialize::<Row>()
        .flat_map(|row| row)  // Unwrap Somes, and skip Nones
        .inspect(|_| total += 1)
        .filter(|row| row.last_modified_date > cutoff)
        .count();

    Ok(Stats { total, recent })
}

I installed it like this (I have rustc 1.26 installed):

▶ cargo build --release --bin single_threaded
▶ time ./target/release/single_threaded ../*.csv.gz

That finishes in 22s.

Now let's try the one that uses all CPUs in parallel:

▶ cargo build --release --bin rayon
▶ time ./target/release/rayon ../*.csv.gz

That took 5.6s

That's rougly 3 times faster than the best Python version.

When chatting with my teammates about this, I "nerd-sniped" in another colleague, Ted Mielczarek who forked Mike's Rust version.

Compile and running these two I get 17.4s for the single-threaded version and 2.5s for the rayon one.

In conclusion

  1. Simplest Python version: 3m44s
  2. Using PyPy (for Python 3.5): 2m30s
  3. Using asyncio: 3m37s
  4. Using concurrent.futures.ThreadPoolExecutor: 7m05s
  5. Using concurrent.futures.ProcessPoolExecutor: 1m5s
  6. Using ciso8601 to parse the dates: 1m08s
  7. Using ciso8601 and concurrent.futures.ProcessPoolExecutor: 18.4s
  8. Novice Go version: 20s
  9. Go version with parallel workers: 5.3s
  10. Single-threaded Rust version: 22s
  11. Parallel workers in Rust: 5.6s
  12. (Ted's) Single-threaded Rust version: 17.4s
  13. (Ted's) Parallel workers in Rust: 2.5s

Most interesting is that this is not surprising. Of course it gets faster if you use more CPUs in parallel. And of course a C binary to do a critical piece in Python will speed things up. What I'm personally quite attracted to is how easy it was to replace the date parsing with ciso8601 in Python and get a more-than-double performance boost with very little work.

Yes, I'm perfectly aware that these are not scientific conditions and the list of disclaimers is long and boring. However, it was fun! It's fun to compare and constrast solutions like this. Don't you think?

Always return namespaces in Django REST Framework

11 May 2018 1 comment   Django, Python


By default, when you hook up a model to Django REST Framework and run a query in JSON format, what you get is a list. E.g.

For GET localhost:8000/api/mymodel/

[
  {"id": 1, "name": "Foo"},
  {"id": 2, "name": "Bar"},
  {"id": 3, "name": "Baz"}
]

This isn't great because there's no good way to include other auxiliary data points that are relevant to this query. In Elasticsearch you get something like this:

{
  "took": 106,
  "timed_out": false,
  "_shards": {},
  "hits": {
    "total": 0,
    "hits": [],
    "max_score": 1
  }
}

Another key is that perhaps today you can't think of any immediate reason why you want to include some additonal meta data about the query, but perhaps some day you will.

The way to solve this in Django REST Framework is to override the list function in your Viewset classes.

Before

# views.py
# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

After

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        # Where the magic happens!
        return response

Now, to re-wrap that, the response.data is a OrderedDict which you can change. Here's one way to do it:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        response.data = {
            'hits': response.data,
        }
        return response

And if you want to do the same the "detail API" where you retrieve a single model instance, you can add an override to the retrieve method:

def retrieve(self, request, *args, **kwargs):
    response = super().retrieve(request, *args, **kwargs)
    response.data = {
        'hit': response.data,
    }
    return response

That's it. Perhaps it's personal preference but if you, like me, prefers this style, this is how you do it. I like namespacing things instead of dealing with raw lists.

"Namespaces are one honking great idea -- let's do more of those!"

From import this

Note! This works equally when you enable pagination. Enabling pagination immediately changes the main result from a list to a dictionary. I.e. Instead of...

[
  {"id": 1, "name": "Foo"},
  {"id": 2, "name": "Bar"},
  {"id": 3, "name": "Baz"}
]

you now get...

{
  "count": 3,
  "next": null,
  "previous": null,
  "items": [
    {"id": 1, "name": "Foo"},
    {"id": 2, "name": "Bar"},
    {"id": 3, "name": "Baz"}
  ]
}

So if you apply the "trick" mentioned in this blog post you end up with...:

{
  "hits": {
    "count": 3,
    "next": null,
    "previous": null,
    "items": [
      {"id": 1, "name": "Foo"},
      {"id": 2, "name": "Bar"},
      {"id": 3, "name": "Baz"}
    ]
  }
}

Fastest Python datetime parser

02 May 2018 0 comments   Python

https://github.com/closeio/ciso8601


tl;dr; It's ciso8601.

I have this Python app that I'm working on. It has a cron job that downloads a listing of every single file listed in an S3 bucket. AWS S3 publishes a manifest of .csv.gz files. You download the manifest and for each hashhashash.csv.gz you download those files. Then my program reads these CSV files and it is able to ignore certain rows based on them being beyond the retention period. It basically parses the ISO formatted datetime as a string, compares it with a cutoff datetime.datetime instance and is able to quickly skip or allow it in for full processing.

At the time of writing, it's roughly 160 .csv.gz files weighing a total of about 2GB. In total it's about 50 millions rows of CSV. That means it's 50 million datetime parsings.

I admit, this cron job doesn't have to be super fast and it's OK if it takes an hour since it's just a cron job running on a server in the cloud somewhere. But I would like to know, is there a way to speed up the date parsing because that feels expensive to do in Python 50 million times per day.

Here's the benchmark:

import csv
import datetime
import random
import statistics
import time

import ciso8601


def f1(datestr):
    return datetime.datetime.strptime(datestr, '%Y-%m-%dT%H:%M:%S.%fZ')


def f2(datestr):
    return ciso8601.parse_datetime(datestr)


def f3(datestr):
    return datetime.datetime(
        int(datestr[:4]),
        int(datestr[5:7]),
        int(datestr[8:10]),
        int(datestr[11:13]),
        int(datestr[14:16]),
        int(datestr[17:19]),
    )


# Assertions
assert f1(
    '2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == f2(
    '2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == f3(
    '2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == '201709211254'


functions = f1, f2, f3
times = {f.__name__: [] for f in functions}

with open('046444ae07279c115edfc23ba1cd8a19.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        func = random.choice(functions)
        t0 = time.clock()
        func(row[3])
        t1 = time.clock()
        times[func.__name__].append((t1 - t0) * 1000)


def ms(number):
    return '{:.5f}ms'.format(number)


for name, numbers in times.items():
    print('FUNCTION:', name, 'Used', format(len(numbers), ','), 'times')
    print('\tBEST  ', ms(min(numbers)))
    print('\tMEDIAN', ms(statistics.median(numbers)))
    print('\tMEAN  ', ms(statistics.mean(numbers)))
    print('\tSTDEV ', ms(statistics.stdev(numbers)))

Yeah, it's a bit ugly but it works. Here's the output:

FUNCTION: f1 Used 111,475 times
    BEST   0.01300ms
    MEDIAN 0.01500ms
    MEAN   0.01685ms
    STDEV  0.00706ms
FUNCTION: f2 Used 111,764 times
    BEST   0.00100ms
    MEDIAN 0.00200ms
    MEAN   0.00197ms
    STDEV  0.00167ms
FUNCTION: f3 Used 111,362 times
    BEST   0.00300ms
    MEDIAN 0.00400ms
    MEAN   0.00409ms
    STDEV  0.00225ms

In summary:

Or, if you compare to the slowest (f1):

UPDATE

If you know with confidence that you don't want or need timezone aware datetime instances, you can use csiso8601.parse_datetime_unaware instead.

from the README:

"Please note that it takes more time to parse aware datetimes, especially if they're not in UTC. If you don't care about time zone information, use the parse_datetime_unaware method, which will discard any time zone information and is faster."

In my benchmark the strings I use look like this 2017-09-21T12:54:24.000Z. I added another function to the benmark that uses ciso8601.parse_datetime_unaware and it clocked in at the exact same time as f2.

Best EXPLAIN ANALYZE benchmark script

19 April 2018 0 comments   PostgreSQL, Python

https://gist.github.com/peterbe/966effb3f357258ddda5aa8ac385b418


tl;dr; Use best-explain-analyze.py to benchmark a SQL query in Postgres.

I often benchmark SQL by extracting the relevant SQL string, prefix it with EXPLAIN ANALYZE, putting it into a file (e.g. benchmark.sql) and then running psql mydatabase < benchmark.sql. That spits out something like this:

                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using main_song_ca949605 on main_song  (cost=0.43..237.62 rows=1 width=4) (actual time=1.586..1.586 rows=0 loops=1)
   Index Cond: (artist_id = 27451)
   Filter: (((name)::text % 'Facing The Abyss'::text) AND (id <> 2856345))
   Rows Removed by Filter: 170
 Planning time: 3.335 ms
 Execution time: 1.701 ms
(6 rows)

Cool. So you study the steps of the query plan and look for "Seq Scan" and various sub-optimal uses of heaps and indices etc. But often, you really want to just look at the Execution time milliseconds number. Especially if you might have to slightly different SQL queries to compare and contrast.

However, as you might have noticed, the number on the Execution time varies between runs. You might think nothing's changed but Postgres might have warmed up some internal caches or your host might be more busy or less busy. To remedy this, you run the EXPLAIN ANALYZE select ... a couple of times to get a feeling for an average. But there's a much better way!

best-explain-analyze.py

Check this out: best-explain-analyze.py

Download it into your ~/bin/ and chmod +x ~/bin/best-explain-analyze.py. I wrote it just this morning so don't judge!

Now, when you run it it runs that thing 10 times (by default) and reports the best Execution time, its mean and its median. Example output:

▶ best-explain-analyze.py songsearch dummy.sql
EXECUTION TIME
    BEST    1.229ms
    MEAN    1.489ms
    MEDIAN  1.409ms
PLANNING TIME
    BEST    1.994ms
    MEAN    4.557ms
    MEDIAN  2.292ms

The "BEST" is an important metric. More important than mean or median.

Raymond Hettinger explains it better than I do. His context is for benchmarking Python code but it's equally applicable:

"Use the min() rather than the average of the timings. That is a recommendation from me, from Tim Peters, and from Guido van Rossum. The fastest time represents the best an algorithm can perform when the caches are loaded and the system isn't busy with other tasks. All the timings are noisy -- the fastest time is the least noisy. It is easy to show that the fastest timings are the most reproducible and therefore the most useful when timing two different implementations."

Efficient many-to-many field lookup in Django REST Framework

11 April 2018 0 comments   PostgreSQL, Django, Python


The basic setup

Suppose you have these models:

from django.db import models


class Category(models.Model):
    name = models.CharField(max_length=100)


class Blogpost(models.Model):
    title = models.CharField(max_length=100)
    categories = models.ManyToManyField(Category)

Suppose you hook these up Django REST Framework and list all Blogpost items. Something like this:

# urls.py
from rest_framework import routers
from . import views


router = routers.DefaultRouter()
router.register(r'blogposts', views.BlogpostViewSet)
# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

What's the problem?

Then, if you execute this list (e.g. curl http://localhost:8000/api/blogposts/) what will happen, on the database, is something like this:

SELECT "app_blogpost"."id", "app_blogpost"."title" FROM "app_blogpost";

SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1025;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 193;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 757;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 853;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1116;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1126;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 964;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 591;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1112;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1034;
...

Obviously, it depends on how you define that serializers.BlogpostSerializer class, but basically, as it loops over the Blogpost, for each and every one, it needs to make a query to the many-to-many table (app_blogpost_categories in this example).

That's not going to be performant. In fact, it might be dangerous on your database if the query of blogposts gets big, like requesting a 100 or 1,000 records. Fetching 1,000 rows from the app_blogpost table might be cheap'ish but doing 1,000 selects with JOIN is never going to be cheap. It adds up horribly.

How you solve it

The trick is to only do 1 query on the many-to-many field's table, 1 query on the app_blogpost table and 1 query on the app_category table.

First you have to override the ViewSet.list method. Then, in there you can do exactly what you need.

Here's the framework for this change:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    # queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def get_queryset(self):
        # Chances are, you're doing something more advanced here 
        # like filtering.
        Blogpost.objects.all().order_by('date')

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        # Where the magic happens!

        return response

Next, we need to make a mapping of all Category.id -1-> Category.name. But we want to make sure we do only on the categories that are involved in the Blogpost records that matter. You could do something like this:

category_names = {}
for category in Category.objects.all():
    category_names[category.id] = category.name

But to avoid doing a lookup of category names for those you never need, use the query set on Blogpost. I.e.

qs = self.get_queryset()
all_categories = Category.objects.filter(
    id__in=Blogpost.categories.through.objects.filter(
        blogpost__in=qs
    ).values('category_id')
)
category_names = {}
for category in all_categories:
    category_names[category.id] = category.name

Now you have a dictionary of all the Category IDs that matter.

Note! The above "optimization" assumes that it's worth it. Meaning, if the number of Category records in your database is huge, and the Blogpost queryset is very filtered, then it's worth only extracting a subset. Alternatively, if you only have like 100 different categories in your database, just do the first variant were you look them up "simplestly" without any fancy joins.

Next, is the mapping of Blogpost.id -N-> Category.name. To do that you need to build up a dictionary (int to list of strings). Like this:

categories_map = defaultdict(list)
for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
    categories_map[m2m.blogpost_id].append(
        category_names[m2m.category_id]
    )

So what we have now is a dictionary whose keys are the IDs in self.get_queryset() and each value is a list of a strings. E.g. ['Category X', 'Category Z'] etc.

Lastly, we need to put these back into the serialized response. This feels a little hackish but it works:

for each in response.data:
    each['categories'] = categories_map.get(each['id'], [])

The whole solution looks something like this:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    # queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def get_queryset(self):
        # Chances are, you're doing something more advanced here 
        # like filtering.
        Blogpost.objects.all().order_by('date')

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        qs = self.get_queryset()
        all_categories = Category.objects.filter(
            id__in=Blogpost.categories.through.objects.filter(
                blogpost__in=qs
            ).values('category_id')
        )
        category_names = {}
        for category in all_categories:
            category_names[category.id] = category.name

        categories_map = defaultdict(list)
        for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
            categories_map[m2m.blogpost_id].append(
                category_names[m2m.category_id]
            )

        for each in response.data:
            each['categories'] = categories_map.get(each['id'], [])

        return response

It's arguably not very pretty but doing 3 tight queries instead of doing as many queries as you have records is much better. O(c) is better than O(n).

Discussion

Perhaps the best solution is to not run into this problem. Like, don't serialize any many-to-many fields.

Or, if you use pagination very conservatively, and only allow like 10 items per page then it won't be so expensive to do one query per every many-to-many field.

hashin 0.12.0 is much much faster

20 March 2018 0 comments   Python


tl;dr; The new 0.12.0 version of hashin is between 6 and 30 times faster.

Version 0.12.0 is exciting because it switches from using https://pypi.python.org/pypi/<package>/json to https://pypi.org/pypi/<package>/json so it's using the new PyPI. I only last week found out about the JSON containing .digest.sha256 as part of the JSON even though apparently it's been there for almost a year!

Prior to version 0.12.0, what hashin used to do is download every tarball and .whl file and run pip on it, in Python, to get the checksum hash. Now, if you use the default sha256, that checksum value is immediately available right there in the JSON, for every file per release. This is especially important for binary packages (lxml for example) where it has to download a lot.

To test this, I cleared my temp directory of any previously downloaded Django-* and lxml-* files and used hashin 0.11.5 to fill a requirements.txt for Django and lxml:

▶ hashin --version
0.11.5
▶ time hashin Django
hashin Django  0.48s user 0.14s system 12% cpu 5.123 total
▶ time hashin lxml
hashin lxml  1.61s user 0.59s system 8% cpu 25.361 total

In other words, 5.1 seconds to get the hashes for Django and 25.4 seconds for lxml.
Now, let's do it with the new 0.12.0

▶ hashin --version
0.12.0
▶ mv requirements.txt 0.11.5-requirements.txt ; touch requirements.txt
▶ time hashin Django
hashin Django  0.34s user 0.06s system 46% cpu 0.860 total
▶ time hashin lxml
hashin lxml  0.35s user 0.06s system 44% cpu 0.909 total

So, instead of 5.1 seconds, now it only takes 0.9 seconds. And instead of 25.4 seconds, now it only takes 0.9 seconds.

Yay!

Note, the old code that downloads and runs pip is still there. It kicks in when you request a digest checksum that is not included in the JSON. For example...:

▶ hashin --version
0.12.0
▶ time hashin --algorithm sha512 lxml
hashin --algorithm sha512 lxml  1.56s user 0.64s system 5% cpu 38.171 total

(The reason this took 38 seconds instead of 25 in the run above is because of the unpredictability of the speed of my home broadband)

Worlds simplest web scraper bot in Python

16 March 2018 1 comment   Python


I just needed a little script to click around a bunch of pages synchronously. It just needed to load the URLs. Not actually do much with the content. Here's what I hacked up:

import random
import requests
from pyquery import PyQuery as pq
from urllib.parse import urljoin


session = requests.Session()
urls = []


def run(url):
    if len(urls) > 100:
        return
    urls.append(url)
    html = session.get(url).decode('utf-8')
    try:
        d = pq(html)
    except ValueError:
        # Possibly weird Unicode errors on OSX due to lxml.
        return
    new_urls = []
    for a in d('a[href]').items():
        uri = a.attr('href')
        if uri.startswith('/') and not uri.startswith('//'):
            new_url = urljoin(url, uri)
            if new_url not in urls:
                new_urls.append(new_url)
    random.shuffle(new_urls)
    [run(x) for x in new_urls]

run('http://localhost:8000/')

If you want to do this when the user is signed in, go to the site in your browser, open the Network tab on your Web Console and copy the value of the Cookie request header.
Change that session.get(url) to something like:

html = session.get(url, headers={'Cookie': 'sessionid=i49q3o66anhvpdaxgldeftsul78bvrpk'}).decode('utf-8')

Now it can spider bot around on your site for a little while as if you're logged in.

Dirty. Simple. Fast.

Now using minimalcss

12 March 2018 0 comments   Node, Javascript, Web development, Python


tl;dr; minimalcss is much better than mincss to slew out the minimal CSS your page needs to render. More accurate and more powerful features. This site now uses minimalcss in inline the minimum CSS needed to render the page.

I started minimalcss back in August 2017 and its goal was ultimately to replace mincss.

The major difference between minimalcss and mincss isn't that one is Node and one is Python, but that minimalcss is based on a full headless browser to handle all the CSS downloading and the proper rendering of the DOM. The other major difference is that mincss was based in regular expressions to analyze the CSS and minimalcss is based on proper abstract syntax tree ("AST") implemented by csso.

Because minimalcss is AST based, it can do a lot more. Smarter. For example, it's able to analyze the CSS to correctly and confidently figure out if any/which keyframe animations and font-face at-rules are actually needed.
Also, because minimalcss is based on csso, when it minifies the CSS it's able to restructure the CSS in a safe and smart way. I.e. p { color: blue; } h2 { color: blue; } becomes p,h2{color:blue}.

So, now I use minimalcss here on this blog. The pages are rendered in Django and a piece of middleware sniffs all outgoing HTML responses and depending on the right conditions it dumps the HTML as a file on disk as path/in/url/index.html. Then, that newly created file is sent to a background worker in Celery which starts post-processing it. Every index.html file is accompanied with the full absolute URL that it belongs to and that's the URL that gets sent to minimalcss which returns the absolute minimal CSS the page needs to load and lastly, a piece of Python script basically does something like this:

From...

<!-- before -->
<link rel="stylesheet" href="/file.css"/>

To...

<!-- after -->
<noscript><link rel="stylesheet" href="/file.css"/></noscript>
<style> ... /* minimal CSS selectors for rendering from /file.css */ ... </style>

There is also a new JavaScript dependency which is the cssrelpreload.js from the loadCSS project. So all the full (original) CSS is still downloaded and inserted into the CSSOM but it happens much later which ultimately means the page can be rendered and useful much sooner than if we'd have to wait to download and parse all of the .css URLs.

I can go into more details if there's interest and others want to do this too. Because this site is all Python and minimalcss is all Node, the integration is done over HTTP on localhost with minimalcss-server.

The results

Unfortunately, this change was mixed in with other smaller optimizations that makes the comparison unfair. (Hey! my personal blog is just a side-project after all). But I downloaded a file before and after the upgrade and compared:

▶ ls -lh *.html
-rw-r--r--  1 peterbe  wheel    19K Mar  7 13:22 after.html
-rw-r--r--  1 peterbe  wheel    96K Mar  7 13:21 before.html

If I extract out the inline style block from both pages and compare it looks like this:
https://gist.github.com/peterbe/fc2fdddd5721fb35a99dc1a50c2b5311

So, downloading the initial HTML document is now 19KB instead of previous 96KB. And visually there's absolutely no difference.

Granted, in the after.html version, a piece of JavaScript kicks in and downloads /static/css/base.min.91f6fc577a60.css and /static/css/base-dynamic.min.e335b9bfa0b1.css from the CDN. So you have to download these too:

▶ ls -lh *.css.gz
-rw-r--r--  1 peterbe  wheel   5.0K Mar  7 10:24 base-dynamic.min.e335b9bfa0b1.css.gz
-rw-r--r--  1 peterbe  wheel    95K Mar  7 10:24 base.min.91f6fc577a60.css.gz

The reason the difference appears to be huge is because I changed a couple of other things around the same time. Sorry. For example, certain DOM nodes were rendered as HTML but made hidden until some jQuery script made it not hidden anymore. For example, the "dimmer" effect over a comment textarea after you hit the submit button. Now, I've changed the jQuery code to build up the DOM when it needs it rather than relying on it being there (hidden). This means that certain base64 embedded font-faces are no longer needed in the minimal CSS payload.

Why this approach is better

So the old approach was to run mincss on the HTML and inject that as an inline style block and throw away the original (relevant) <link rel="stylesheet" href="..."> tags.
That had the annoying drawback that there was CSS in the stylesheets that I knew was going to be needed by some XHR or JavaScript later. For example, if you post a comment some jQuery code changes the DOM and that new DOM needs these CSS selectors later. So I had to do things like this:

.project a.perm { /* no mincss */
    font-size: 0.7em;
    padding-left: 8px;
}
.project a.perm:link { /* no mincss */
    color: rgb(151,151,151);
}
.project a.perm:hover { /* no mincss */
    color: rgb(51,51,51);
}

This was to inform mincss to leave those untouched even though no DOM node uses them right now. With minimalcss this is no longer needed.

What's next?

Keep working on minimalcss and make it even better.

Also, the scripting I used to modify the HTML file is a hack and should probably be put into the minimalcss project.

Last but not least, every time I put in some effort to web performance optimize my blog pages my Google ranking goes up and I usually see an increase in Google referrals in my Google Analytics because it's pretty obvious that Google loves fast sites. So I'm optimistically waiting for that effect.