Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page!
Currently only showing blog entries under the category: Python. Clear filter

Whatsdeployed rewritten in React

15 April 2019 0 comments   Javascript, ReactJS, Python, Web development


A couple of months ago my colleague Michael @mythmon Cooper wanted to add a feature to the front-end code of Whatsdeployed and learned that the whole front-end is spaghetti jQuery code. So, instead, he re-wrote it in React. My only requirements were "Use create-react-app and no redux", i.e. keep it simple.

We also took the opportunity to rewrite some of the ways that URLs are handled. It used to be that a "short link" would redirect. For example GET /s-5HY would return 302 to Location: ?org=mozilla&repo=tecken&name[]=Dev&url[]=https://symbols.dev.mozaws.net/__version__&name[]=Stage... Basically, the short link was just an alias for a redirect. Just like those services like bit.ly or g.co . Now, the short link is a permanent fixture. The short link is included in the XHR calls to the server for getting the relevant data.

All old URLs will continue to work but now the canonical URL becomes /s/5HY/mozilla-services/tecken, for example. The :org/:repo isn't really necessary because the server knows exactly what 5HY (in this example means), but it's nice for the URL bar's memory.

Another thing that changed was how it can recognize "bors commits". When you use bors, you put a bunch of commits into a GitHub Pull Request and then ask the bors bot to merge them into master. Using "bors mode" in Whatsdeployed is optional but we believe it looks a lot more user-friendly. Here is an example of mozilla/normandy with and without bors toggled on and off.

Without "bors mode"
Without "bors mode"

With "bors mode"
With "bors mode"

Thank you mythmon!

Lastly, hopefully this will make it a lot easier to contribute. Check out https://github.com/peterbe/whatsdeployed. All you need is Python 3, a PostgreSQL, and almost any version of Node that can run create-react-apps. Ping me if you find it hard to get up and running.

Django ORM optimization story on selecting the least possible

22 February 2019 16 comments   PostgreSQL, Python, Django, Web development


This an optimization story that should not surprise anyone using the Django ORM. But I thought I'd share because I have numbers now! The origin of this came from a real requirement. For a given parent model, I'd like to extract the value of the name column of all its child models, and the turn all these name strings into 1 MD5 checksum string.

Variants

The first attempted looked like this:

artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist):
    names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

The SQL used to generate this is as follows:

SELECT "main_song"."id", "main_song"."artist_id", "main_song"."name", 
"main_song"."text", "main_song"."language", "main_song"."key_phrases", 
"main_song"."popularity", "main_song"."text_length", "main_song"."metadata", 
"main_song"."created", "main_song"."modified", 
"main_song"."has_lastfm_listeners", "main_song"."has_spotify_popularity" 
FROM "main_song" WHERE "main_song"."artist_id" = 22729;

Clearly, I don't need anything but just the name column, version 2:

artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist).only("name"):
    names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

Now, the SQL used is:

SELECT "main_song"."id", "main_song"."name" 
FROM "main_song" WHERE "main_song"."artist_id" = 22729;

But still, since I don't really need instances of model class Song I can use the .values() method which gives back a list of dictionaries. This is version 3:

names = []
for song in Song.objects.filter(artist=a).values("name"):
    names.append(song["name"])
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

This time Django figures it doesn't even need the primary key value so it looks like this:

SELECT "main_song"."name" FROM "main_song" WHERE "main_song"."artist_id" = 22729;

Last but not least; there is an even faster one. values_list(). This time it doesn't even bother to map the column name to the value in a dictionary. And since I only need 1 column's value, I can set flat=True. Version 4 looks like this:

names = []
for name in Song.objects.filter(artist=a).values_list("name", flat=True):
    names.append(name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

Same SQL gets used this time as in version 3.

The benchmark

Hopefully this little benchmark script speaks for itself:

from songsearch.main.models import *

import hashlib


def f1(a):
    names = []
    for song in Song.objects.filter(artist=a):
        names.append(song.name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f2(a):
    names = []
    for song in Song.objects.filter(artist=a).only("name"):
        names.append(song.name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f3(a):
    names = []
    for song in Song.objects.filter(artist=a).values("name"):
        names.append(song["name"])
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f4(a):
    names = []
    for name in Song.objects.filter(artist=a).values_list("name", flat=True):
        names.append(name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


artist = Artist.objects.get(name="Bad Religion")
print(Song.objects.filter(artist=artist).count())

print(f1(artist) == f2(artist))
print(f2(artist) == f3(artist))
print(f3(artist) == f4(artist))

# Reporting
import time
import random
import statistics

functions = f1, f2, f3, f4
times = {f.__name__: [] for f in functions}

for i in range(500):
    func = random.choice(functions)
    t0 = time.time()
    func(artist)
    t1 = time.time()
    times[func.__name__].append((t1 - t0) * 1000)

for name in sorted(times):
    numbers = times[name]
    print("FUNCTION:", name, "Used", len(numbers), "times")
    print("\tBEST", min(numbers))
    print("\tMEDIAN", statistics.median(numbers))
    print("\tMEAN  ", statistics.mean(numbers))
    print("\tSTDEV ", statistics.stdev(numbers))

I ran this on my PostgreSQL 11.1 on my MacBook Pro with Django 2.1.7. So the database is on localhost.

The results

276
True
True
True
FUNCTION: f1 Used 135 times
    BEST 6.309986114501953
    MEDIAN 7.531881332397461
    MEAN   7.834429211086697
    STDEV  2.03779968066591
FUNCTION: f2 Used 135 times
    BEST 3.039121627807617
    MEDIAN 3.7298202514648438
    MEAN   4.012803678159361
    STDEV  1.8498943539073027
FUNCTION: f3 Used 110 times
    BEST 0.9920597076416016
    MEDIAN 1.4405250549316406
    MEAN   1.5053835782137783
    STDEV  0.3523240470133114
FUNCTION: f4 Used 120 times
    BEST 0.9369850158691406
    MEDIAN 1.3251304626464844
    MEAN   1.4017681280771892
    STDEV  0.3391019435930447

Bar chart

Discussion

I guess the hashlib.md5("".join(names).encode("utf-8")).hexdigest() stuff is a bit "off-topic" but I checked and it's roughly 300 times faster than building up the names list.

It's clearly better to ask less of Python and PostgreSQL to get a better total time. No surprise there. What was interesting was the proportion of these differences. Memorize that and you'll be better equipped if it's worth the hassle of not using the Django ORM in the most basic form.

Also, do take note that this is only relevant in when dealing with many records. The slowest variant (f1) takes, on average, 7 milliseconds.

Summarizing the difference with percentages compared to the fastest variant:

UPDATE Feb 25 2019

James suggested, although a bit "missing the point", that it could be even faster if all the aggregation is pushed into the PostgreSQL server and then the only thing that needs to transfer from PostgreSQL to Python is the final result.

By the way, name column in this particular benchmark, when concatenated into one big string, is ~4KB. So, with variant f5 it only needs to transfer 32 bytes which will/would make a bigger difference if the network latency is higher.

Here's the whole script: https://gist.github.com/peterbe/b2b7ed95d422ab25a65639cb8412e75e

And the results:

276
True
True
True
False
False
FUNCTION: f1 Used 92 times
    BEST 5.928993225097656
    MEDIAN 7.311463356018066
    MEAN   7.594626882801885
    STDEV  2.2027017044658423
FUNCTION: f2 Used 75 times
    BEST 2.878904342651367
    MEDIAN 3.3979415893554688
    MEAN   3.4774907430013022
    STDEV  0.5120246550765524
FUNCTION: f3 Used 88 times
    BEST 0.9310245513916016
    MEDIAN 1.1944770812988281
    MEAN   1.3105544176968662
    STDEV  0.35922655625999383
FUNCTION: f4 Used 71 times
    BEST 0.7879734039306641
    MEDIAN 1.1661052703857422
    MEAN   1.2262606284987758
    STDEV  0.3561764250427344
FUNCTION: f5 Used 90 times
    BEST 0.7929801940917969
    MEDIAN 1.0334253311157227
    MEAN   1.1836051940917969
    STDEV  0.4001442703048186
FUNCTION: f6 Used 84 times
    BEST 0.80108642578125
    MEDIAN 1.1119842529296875
    MEAN   1.2281338373819988
    STDEV  0.37146893005516973

Result: f5 is takes 0.793ms and (the previous "winner") f4 takes 0.788ms.

I'm not entirely sure why f5 isn't faster but I suspect it's because the dataset is too small for it all to matter.

Compare:

songsearch=# explain analyze SELECT "main_song"."name" FROM "main_song" WHERE "main_song"."artist_id" = 22729;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using main_song_ca949605 on main_song  (cost=0.43..229.33 rows=56 width=16) (actual time=0.014..0.208 rows=276 loops=1)
   Index Cond: (artist_id = 22729)
 Planning Time: 0.113 ms
 Execution Time: 0.242 ms
(4 rows)

with...

songsearch=# explain analyze SELECT md5(STRING_AGG("main_song"."name", '')) AS "names_hash" FROM "main_song" WHERE "main_song"."artist_id" = 22729;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=229.47..229.48 rows=1 width=32) (actual time=0.278..0.278 rows=1 loops=1)
   ->  Index Scan using main_song_ca949605 on main_song  (cost=0.43..229.33 rows=56 width=16) (actual time=0.019..0.204 rows=276 loops=1)
         Index Cond: (artist_id = 22729)
 Planning Time: 0.115 ms
 Execution Time: 0.315 ms
(5 rows)

I ran these two SQL statements about 100 times each and recorded their best possible execution times:

1) The plain SELECT - 0.99ms
2) The STRING_AGG - 1.06ms

So that accounts from ~0.1ms difference only! Which kinda matches the results seen above. All in all, I think the dataset is too small to demonstrate this technique. But, considering the chance that the complexity might not be linear with the performance benefit, it's still interesting.

Even though this tangent is a big off-topic, it is often a great idea to push as much work into the database as you can if applicable. Especially if it means you can transfer a lot less data eventually.

Format thousands in Python

01 February 2019 2 comments   Python


tl;dr; Use f"{number:,}" to thousands format an integer to a string.

I keep forgetting and having to look this up every time. Hopefully by blogging about it, this time it'll stick in my memory. And hopefully in yours too :)

Suppose you have a number, like 1234567890 and you want to display it, here's how you do it:

>>> number = 1234567890
>>> f"{number:,}"
'1,234,567,890'

In the past, before Python 3.6, I've been using:

>>> number = 1234567890
>>> format(number, ",")
'1,234,567,890'

All of this and more detail can be found in PEP 378 -- Format Specifier for Thousands Separator. For example, you can do this beast too:

>>> number = 1234567890
>>> f"{number:020,.2f}"
'0,001,234,567,890.00'

which demonstrates (1) how to do zero-padding (of length 20), (2) the thousands comma, (3) round to 2 significant figures. All useful weapons to be able to draw from the top of your head.

UPDATE

Also, incredibly useful is the equivalent of somestring.ljust(10):

>>> mystr = "peter"
>>> f"{mystr:10}"
'peter     '
>>> f"{mystr:>10}"
'     peter'

hashin 0.14.5 and canonical pip hashes

31 January 2019 0 comments   Python

https://github.com/peterbe/hashin/issues/105


Prior to version 0.14.5 hashin would write write down the hashes of PyPI packages in the order they appear in PyPI's JSON response. That means there's a slight chance that two distinct clients/computers/humans might actually get different output when then run hashin Django==2.1.5.

The pull request has a pretty hefty explanation as it demonstrates the fix.

Do note that if the existing order of hashes in a requirements file is not in the "right" order, hashin won't correct it unless any of the hashes are different.

Thanks @SomberNight for patiently pushing for this.

variable_cache_control - Django view decorator to set max_age in runtime

22 January 2019 0 comments   Python, Django


tl;dr; If you use the django.views.decorators.cache.cache_control decorator, consider this one instead to change the max_age depending on the request.

I had/have a Django view function that looks something like this:

@cache_control(public=True, max_age=60 * 60)
def home(request, oc=None, page=1):
    ...

But, that number 60 * 60 I really needed it to be different depending on the request parameters. For example, that oc=None, if that's not None I know the page's Cache-Control header can and should be different.

So I wrote this decorator:

from django.utils.cache import patch_cache_control


def variable_cache_control(**kwargs):
    """Same as django.views.decorators.cache.cache_control except this one will
    allow the `max_age` parameter be a callable.
    """

    def _cache_controller(viewfunc):
        @functools.wraps(viewfunc)
        def _cache_controlled(request, *args, **kw):
            response = viewfunc(request, *args, **kw)
            copied = kwargs
            if kwargs.get("max_age") and callable(kwargs["max_age"]):
                max_age = kwargs["max_age"](request, *args, **kw)
                # Can't re-use, have to create a shallow clone.
                copied = dict(kwargs, max_age=max_age)
            patch_cache_control(response, **copied)
            return response

        return _cache_controlled

    return _cache_controller

Now, I can do this instead:

def _best_max_age(req, oc=None, **kwargs):
    max_age = 60 * 60
    if oc:
        max_age *= 10
    return max_age

@variable_cache_control(public=True, max_age=_best_max_age)
def home(request, oc=None, page=1):
    ...

I hope it inspires.

Use vars() to send an argparse Namespace into a function in Python

08 January 2019 0 comments   Python

https://docs.python.org/3/library/functions.html#vars


I only just learned about this today after all these years and thought you might like it too.

The trick is to conveniently turn an argparse.Namespace into keyword arguments that you can send to a function. This is the old/wrong way I've been doing it for years:

# THE OLD WAY

def main(things, option_a, option_n):
    print(locals())  # Debugging 


import argparse

parser = argparse.ArgumentParser()
parser.add_argument("things", help="Bla bla", nargs="*")
parser.add_argument("-o", "--option-a", help="Bla bla", default="Op A")
parser.add_argument("-n", "--option-n", help="Ble ble", default="Op N")
args = parser.parse_args()

main(
    things=args.things,
    option_a=args.option_a,
    option_n=args.option_n
)

That works but the tedious thing is to have to have spell out every single argument, twice!, when sending the argparse Namespace into the function. Here's the much moar betterest way:

# THE NEW WAY

def main(things, option_a, option_n):
    print(locals())  # Debugging 


import argparse

parser = argparse.ArgumentParser()
parser.add_argument("things", help="Bla bla", nargs="*")
parser.add_argument("-o", "--option-a", help="Bla bla", default="Op A")
parser.add_argument("-n", "--option-n", help="Ble ble", default="Op N")
args = parser.parse_args()

# The only difference and the magic sauce...
main(**vars(args))  

What's neat about this is that you don't have to type up every argument defined in the parser to the get it as arguments into a function. And as a bonus, Python will name match keyword arguments to arguments so the order doesn't matter.

Caveat! This "trick" assumes that the arguments in the parser match the arguments in the function. So if the main() function takes an argument called foo_bar you have to have an argument in the parser called --foo-bar.