Peterbe.com

A blog and website by Peter Bengtsson

Always return namespaces in Django REST Framework

11 May 2018 1 comment   Django, Python


By default, when you hook up a model to Django REST Framework and run a query in JSON format, what you get is a list. E.g.

For GET localhost:8000/api/mymodel/

[
  {"id": 1, "name": "Foo"},
  {"id": 2, "name": "Bar"},
  {"id": 3, "name": "Baz"}
]

This isn't great because there's no good way to include other auxiliary data points that are relevant to this query. In Elasticsearch you get something like this:

{
  "took": 106,
  "timed_out": false,
  "_shards": {},
  "hits": {
    "total": 0,
    "hits": [],
    "max_score": 1
  }
}

Another key is that perhaps today you can't think of any immediate reason why you want to include some additonal meta data about the query, but perhaps some day you will.

The way to solve this in Django REST Framework is to override the list function in your Viewset classes.

Before

# views.py
# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

After

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        # Where the magic happens!
        return response

Now, to re-wrap that, the response.data is a OrderedDict which you can change. Here's one way to do it:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        response.data = {
            'hits': response.data,
        }
        return response

And if you want to do the same the "detail API" where you retrieve a single model instance, you can add an override to the retrieve method:

def retrieve(self, request, *args, **kwargs):
    response = super().retrieve(request, *args, **kwargs)
    response.data = {
        'hit': response.data,
    }
    return response

That's it. Perhaps it's personal preference but if you, like me, prefers this style, this is how you do it. I like namespacing things instead of dealing with raw lists.

"Namespaces are one honking great idea -- let's do more of those!"

From import this

Note! This works equally when you enable pagination. Enabling pagination immediately changes the main result from a list to a dictionary. I.e. Instead of...

[
  {"id": 1, "name": "Foo"},
  {"id": 2, "name": "Bar"},
  {"id": 3, "name": "Baz"}
]

you now get...

{
  "count": 3,
  "next": null,
  "previous": null,
  "items": [
    {"id": 1, "name": "Foo"},
    {"id": 2, "name": "Bar"},
    {"id": 3, "name": "Baz"}
  ]
}

So if you apply the "trick" mentioned in this blog post you end up with...:

{
  "hits": {
    "count": 3,
    "next": null,
    "previous": null,
    "items": [
      {"id": 1, "name": "Foo"},
      {"id": 2, "name": "Bar"},
      {"id": 3, "name": "Baz"}
    ]
  }
}

How I found out where a bash alias was set up

09 May 2018 0 comments   Linux


I wanted to install a command line tool called gg. But for some reason, gg was already tied to an alias. No problem, I'll just delete that alias. I looked in ~/.bash_profile and I looked in ~/.zshrc and it wasn't there!

But here's how I managed to figure out where it came from:

▶ which gg
gg: aliased to git gui citool

Then I copied the git gui citool part of that output and ran:

▶ rg --hidden 'git gui citool'
.oh-my-zsh/plugins/git/git.plugin.zsh
104:alias gg='git gui citool'
105:alias gga='git gui citool --amend'

A ha! So it was .oh-my-zsh/plugins/git/git.plugin.zsh that was the culprit. Totally forgot about the plugin. It's full of other useful aliases so I just commented out the one(s) I knew I don't need any more.

By the way rg, aka. ripgrep is probably one of the best tools I have. I use it so often that it's attached to my belt rather than in my toolbox.

Real minimal example of going from setState to MobX

04 May 2018 0 comments   ReactJS, Javascript

https://github.com/peterbe/workon/commit/c1846ce782ce7c9da16f44b10c48f0be1337ae41


This is not meant as a tutorial on MobX but hopefully it can be inspirational for people who have grokked how React's setState works but now feel they need to move the state management in their React app out of the components.

Store.js
To jump right in, here is a changeset that demonstrates how to replace setState with a MobX store:
https://github.com/peterbe/workon/commit/c1846ce782ce7c9da16f44b10c48f0be1337ae41

It's a really simple Todo list application based on create-react-app. Not much to read into at this point.

Here are some caveats to be aware if you look at the diff and wonder...

Caveat last but not least... This diff does not much other than adding more library dependencies and fancy "observable arrays" that are hard to introspect with console.log debugging.
However, the intention is to...

  1. Add react-router to the mix so opening the Todo list is just one of many possible views.
  2. Now the Store.js file can be all about data. Data retrieval, storage, manipulation, mutation etc. The other components will be more simple since their only job is to render that's in the store and send events back to the store based on user actions.
  3. Note that the store is also put into window. That means I can open the web console and type store.items[2].text = "Test change" and simply by hitting enter the app re-renders to this change.

gtop is best

02 May 2018 0 comments   Javascript, MacOSX, Linux

https://github.com/aksakalli/gtop


To me, using top inside a Linux server via SSH is all muscle-memory and it's definitely good enough. On my Macbook when working on some long-running code that is resource intensive the best tool I know of is: gtop

gtop in action
gtop in action

I like it because it has the graphs I want and need. It splits up the work of each CPU which is awesome. That's useful for understanding how well a program is able to leverage more than one CPU process.

And it's really nice to have the list of Processes there to be able to quickly compare which programs are running and how that might affect the use of the CPUs.

Instead of listing alternatives I've tried before, hopefully this Reddit discussion has good links to other alternatives

Fastest Python datetime parser

02 May 2018 0 comments   Python

https://github.com/closeio/ciso8601


tl;dr; It's ciso8601.

I have this Python app that I'm working on. It has a cron job that downloads a listing of every single file listed in an S3 bucket. AWS S3 publishes a manifest of .csv.gz files. You download the manifest and for each hashhashash.csv.gz you download those files. Then my program reads these CSV files and it is able to ignore certain rows based on them being beyond the retention period. It basically parses the ISO formatted datetime as a string, compares it with a cutoff datetime.datetime instance and is able to quickly skip or allow it in for full processing.

At the time of writing, it's roughly 160 .csv.gz files weighing a total of about 2GB. In total it's about 50 millions rows of CSV. That means it's 50 million datetime parsings.

I admit, this cron job doesn't have to be super fast and it's OK if it takes an hour since it's just a cron job running on a server in the cloud somewhere. But I would like to know, is there a way to speed up the date parsing because that feels expensive to do in Python 50 million times per day.

Here's the benchmark:

import csv
import datetime
import random
import statistics
import time

import ciso8601


def f1(datestr):
    return datetime.datetime.strptime(datestr, '%Y-%m-%dT%H:%M:%S.%fZ')


def f2(datestr):
    return ciso8601.parse_datetime(datestr)


def f3(datestr):
    return datetime.datetime(
        int(datestr[:4]),
        int(datestr[5:7]),
        int(datestr[8:10]),
        int(datestr[11:13]),
        int(datestr[14:16]),
        int(datestr[17:19]),
    )


# Assertions
assert f1(
    '2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == f2(
    '2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == f3(
    '2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == '201709211254'


functions = f1, f2, f3
times = {f.__name__: [] for f in functions}

with open('046444ae07279c115edfc23ba1cd8a19.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        func = random.choice(functions)
        t0 = time.clock()
        func(row[3])
        t1 = time.clock()
        times[func.__name__].append((t1 - t0) * 1000)


def ms(number):
    return '{:.5f}ms'.format(number)


for name, numbers in times.items():
    print('FUNCTION:', name, 'Used', format(len(numbers), ','), 'times')
    print('\tBEST  ', ms(min(numbers)))
    print('\tMEDIAN', ms(statistics.median(numbers)))
    print('\tMEAN  ', ms(statistics.mean(numbers)))
    print('\tSTDEV ', ms(statistics.stdev(numbers)))

Yeah, it's a bit ugly but it works. Here's the output:

FUNCTION: f1 Used 111,475 times
    BEST   0.01300ms
    MEDIAN 0.01500ms
    MEAN   0.01685ms
    STDEV  0.00706ms
FUNCTION: f2 Used 111,764 times
    BEST   0.00100ms
    MEDIAN 0.00200ms
    MEAN   0.00197ms
    STDEV  0.00167ms
FUNCTION: f3 Used 111,362 times
    BEST   0.00300ms
    MEDIAN 0.00400ms
    MEAN   0.00409ms
    STDEV  0.00225ms

In summary:

Or, if you compare to the slowest (f1):

UPDATE

If you know with confidence that you don't want or need timezone aware datetime instances, you can use csiso8601.parse_datetime_unaware instead.

from the README:

"Please note that it takes more time to parse aware datetimes, especially if they're not in UTC. If you don't care about time zone information, use the parse_datetime_unaware method, which will discard any time zone information and is faster."

In my benchmark the strings I use look like this 2017-09-21T12:54:24.000Z. I added another function to the benmark that uses ciso8601.parse_datetime_unaware and it clocked in at the exact same time as f2.

The impressive first-meaningful-paint improvement of using minimalcss

24 April 2018 1 comment   Javascript, Web development


tl;dr; The critical CSS solution, using minimalcss, yields a 40% improvement in First Meaningful Paint and 90% improvement in the Time to Start Render.

About a month ago I enabled minimalcss here on my personal blog to properly test it in production. In that blog post I only looked at the difference in file size. This time, I'm testing the impact of it using Webpagetest.org. I picked two blog posts (that don't have images), but both have some syntax highlighting of code CSS. One and Two. Doesn't matter what's in those two pages but they're relatively similar in shape and size.

In three Webpagetest.org experiments I compare, on 4G Chome...

  1. both optimized
  2. first one optimized only
  3. second one optimized only
  4. both not optimized

Note! In the following screenshots from Webpagetest the "Thumbnail interval" is set to 0.5 seconds.

Note2! These pages used HTTP/2 and the CSS stylesheets are loaded from a CDN.

BOTH pages optimized

Both pages optimized

Timings both optimized

They are about 0.3 seconds apart in their first meaningful paint.

This is the baseline comparison. It's not perfectly the same first-meaningful-paint but close enough. The point is what difference it makes later.

Only FIRST page optimized

First page optimized

Timings first page optimized

The optimized page is 0.7 seconds faster.

Only SECOND page optimized

Second page optimized

Timings second page optimized

The optimized page is 1.4 seconds faster.

Bonus! Both pages NOT optimized

Just to check that it all holds up. Here, both pages are compared with out the critical CSS optimization.

Both pages NOT optimized

Timings both NOT optimized

If we roughly average out the first paint on the sample where both were optimized (2.1 seconds), this time it's 2.8 seconds. So the optimization of the critical CSS with minimalcss roughly makes the first paint makes it 40% faster.

Discussion

In Webpagetest.org the "First Meaningful Paint" is just one of many ways of measuring "success". I put that last word in quotation marks because this stuff is not trivial. Just because you manage to show anything to the user doesn't necessarily mean the user is happy if the user can't do what they want to do. And if you front-load a bunch of things with every trick in the book, you might have a lot of load in the background that might affect the scrolling with yank or flashing content.

I never really know which one to live by as a measure of success but the visual comparison timeline from Webpagetest definitely is fruitful and easy to understand. There is also "First Contentful Paint" which shows a slightly bigger difference between the optimized and the not-optimized pages.

For now, I'm going to call this a success. Adding minimalcss was a mixed bag of challenges. The execution of the script, on a server, requires the right amount of tooling and safeguards. After all, it depends on real web browser running inside a web server spawned from a background asynchronous message queue task with retries, logging, and metrics.

For one thing, if you just look at the visual comparisons and focus only on the rendered title the difference is not 40% faster. It's about 100% faster. That difference is explained by the word "meaningful" in First Meaningful Paint. The rest of the content is at the mercy of the banner ad and the remote loaded web fonts. For example, if you instead compare the "Time to Start Render" average timings of the optimized pages was 1.6 seconds vs 3.1 seconds for the not optimized pages. In other words, the improvement of First Meaningul Paint was 40% and the improvement of Time to Start Render was 93%.

Lastly, remember that what minimalcss (and other critical path CSS optimization tools) does is that it copies CSS from the .css files and includes it in the HTML document. That copying means the HTML document weighs 27KB more and still it wins.

Best EXPLAIN ANALYZE benchmark script

19 April 2018 0 comments   PostgreSQL, Python

https://gist.github.com/peterbe/966effb3f357258ddda5aa8ac385b418


tl;dr; Use best-explain-analyze.py to benchmark a SQL query in Postgres.

I often benchmark SQL by extracting the relevant SQL string, prefix it with EXPLAIN ANALYZE, putting it into a file (e.g. benchmark.sql) and then running psql mydatabase < benchmark.sql. That spits out something like this:

                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using main_song_ca949605 on main_song  (cost=0.43..237.62 rows=1 width=4) (actual time=1.586..1.586 rows=0 loops=1)
   Index Cond: (artist_id = 27451)
   Filter: (((name)::text % 'Facing The Abyss'::text) AND (id <> 2856345))
   Rows Removed by Filter: 170
 Planning time: 3.335 ms
 Execution time: 1.701 ms
(6 rows)

Cool. So you study the steps of the query plan and look for "Seq Scan" and various sub-optimal uses of heaps and indices etc. But often, you really want to just look at the Execution time milliseconds number. Especially if you might have to slightly different SQL queries to compare and contrast.

However, as you might have noticed, the number on the Execution time varies between runs. You might think nothing's changed but Postgres might have warmed up some internal caches or your host might be more busy or less busy. To remedy this, you run the EXPLAIN ANALYZE select ... a couple of times to get a feeling for an average. But there's a much better way!

best-explain-analyze.py

Check this out: best-explain-analyze.py

Download it into your ~/bin/ and chmod +x ~/bin/best-explain-analyze.py. I wrote it just this morning so don't judge!

Now, when you run it it runs that thing 10 times (by default) and reports the best Execution time, its mean and its median. Example output:

▶ best-explain-analyze.py songsearch dummy.sql
EXECUTION TIME
    BEST    1.229ms
    MEAN    1.489ms
    MEDIAN  1.409ms
PLANNING TIME
    BEST    1.994ms
    MEAN    4.557ms
    MEDIAN  2.292ms

The "BEST" is an important metric. More important than mean or median.

Raymond Hettinger explains it better than I do. His context is for benchmarking Python code but it's equally applicable:

"Use the min() rather than the average of the timings. That is a recommendation from me, from Tim Peters, and from Guido van Rossum. The fastest time represents the best an algorithm can perform when the caches are loaded and the system isn't busy with other tasks. All the timings are noisy -- the fastest time is the least noisy. It is easy to show that the fastest timings are the most reproducible and therefore the most useful when timing two different implementations."

Efficient many-to-many field lookup in Django REST Framework

11 April 2018 0 comments   PostgreSQL, Django, Python


The basic setup

Suppose you have these models:

from django.db import models


class Category(models.Model):
    name = models.CharField(max_length=100)


class Blogpost(models.Model):
    title = models.CharField(max_length=100)
    categories = models.ManyToManyField(Category)

Suppose you hook these up Django REST Framework and list all Blogpost items. Something like this:

# urls.py
from rest_framework import routers
from . import views


router = routers.DefaultRouter()
router.register(r'blogposts', views.BlogpostViewSet)
# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

What's the problem?

Then, if you execute this list (e.g. curl http://localhost:8000/api/blogposts/) what will happen, on the database, is something like this:

SELECT "app_blogpost"."id", "app_blogpost"."title" FROM "app_blogpost";

SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1025;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 193;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 757;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 853;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1116;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1126;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 964;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 591;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1112;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1034;
...

Obviously, it depends on how you define that serializers.BlogpostSerializer class, but basically, as it loops over the Blogpost, for each and every one, it needs to make a query to the many-to-many table (app_blogpost_categories in this example).

That's not going to be performant. In fact, it might be dangerous on your database if the query of blogposts gets big, like requesting a 100 or 1,000 records. Fetching 1,000 rows from the app_blogpost table might be cheap'ish but doing 1,000 selects with JOIN is never going to be cheap. It adds up horribly.

How you solve it

The trick is to only do 1 query on the many-to-many field's table, 1 query on the app_blogpost table and 1 query on the app_category table.

First you have to override the ViewSet.list method. Then, in there you can do exactly what you need.

Here's the framework for this change:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    # queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def get_queryset(self):
        # Chances are, you're doing something more advanced here 
        # like filtering.
        Blogpost.objects.all().order_by('date')

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        # Where the magic happens!

        return response

Next, we need to make a mapping of all Category.id -1-> Category.name. But we want to make sure we do only on the categories that are involved in the Blogpost records that matter. You could do something like this:

category_names = {}
for category in Category.objects.all():
    category_names[category.id] = category.name

But to avoid doing a lookup of category names for those you never need, use the query set on Blogpost. I.e.

qs = self.get_queryset()
all_categories = Category.objects.filter(
    id__in=Blogpost.categories.through.objects.filter(
        blogpost__in=qs
    ).values('category_id')
)
category_names = {}
for category in all_categories:
    category_names[category.id] = category.name

Now you have a dictionary of all the Category IDs that matter.

Note! The above "optimization" assumes that it's worth it. Meaning, if the number of Category records in your database is huge, and the Blogpost queryset is very filtered, then it's worth only extracting a subset. Alternatively, if you only have like 100 different categories in your database, just do the first variant were you look them up "simplestly" without any fancy joins.

Next, is the mapping of Blogpost.id -N-> Category.name. To do that you need to build up a dictionary (int to list of strings). Like this:

categories_map = defaultdict(list)
for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
    categories_map[m2m.blogpost_id].append(
        category_names[m2m.category_id]
    )

So what we have now is a dictionary whose keys are the IDs in self.get_queryset() and each value is a list of a strings. E.g. ['Category X', 'Category Z'] etc.

Lastly, we need to put these back into the serialized response. This feels a little hackish but it works:

for each in response.data:
    each['categories'] = categories_map.get(each['id'], [])

The whole solution looks something like this:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    # queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def get_queryset(self):
        # Chances are, you're doing something more advanced here 
        # like filtering.
        Blogpost.objects.all().order_by('date')

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        qs = self.get_queryset()
        all_categories = Category.objects.filter(
            id__in=Blogpost.categories.through.objects.filter(
                blogpost__in=qs
            ).values('category_id')
        )
        category_names = {}
        for category in all_categories:
            category_names[category.id] = category.name

        categories_map = defaultdict(list)
        for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
            categories_map[m2m.blogpost_id].append(
                category_names[m2m.category_id]
            )

        for each in response.data:
            each['categories'] = categories_map.get(each['id'], [])

        return response

It's arguably not very pretty but doing 3 tight queries instead of doing as many queries as you have records is much better. O(c) is better than O(n).

Discussion

Perhaps the best solution is to not run into this problem. Like, don't serialize any many-to-many fields.

Or, if you use pagination very conservatively, and only allow like 10 items per page then it won't be so expensive to do one query per every many-to-many field.

hashin 0.12.0 is much much faster

20 March 2018 0 comments   Python


tl;dr; The new 0.12.0 version of hashin is between 6 and 30 times faster.

Version 0.12.0 is exciting because it switches from using https://pypi.python.org/pypi/<package>/json to https://pypi.org/pypi/<package>/json so it's using the new PyPI. I only last week found out about the JSON containing .digest.sha256 as part of the JSON even though apparently it's been there for almost a year!

Prior to version 0.12.0, what hashin used to do is download every tarball and .whl file and run pip on it, in Python, to get the checksum hash. Now, if you use the default sha256, that checksum value is immediately available right there in the JSON, for every file per release. This is especially important for binary packages (lxml for example) where it has to download a lot.

To test this, I cleared my temp directory of any previously downloaded Django-* and lxml-* files and used hashin 0.11.5 to fill a requirements.txt for Django and lxml:

▶ hashin --version
0.11.5
▶ time hashin Django
hashin Django  0.48s user 0.14s system 12% cpu 5.123 total
▶ time hashin lxml
hashin lxml  1.61s user 0.59s system 8% cpu 25.361 total

In other words, 5.1 seconds to get the hashes for Django and 25.4 seconds for lxml.
Now, let's do it with the new 0.12.0

▶ hashin --version
0.12.0
▶ mv requirements.txt 0.11.5-requirements.txt ; touch requirements.txt
▶ time hashin Django
hashin Django  0.34s user 0.06s system 46% cpu 0.860 total
▶ time hashin lxml
hashin lxml  0.35s user 0.06s system 44% cpu 0.909 total

So, instead of 5.1 seconds, now it only takes 0.9 seconds. And instead of 25.4 seconds, now it only takes 0.9 seconds.

Yay!

Note, the old code that downloads and runs pip is still there. It kicks in when you request a digest checksum that is not included in the JSON. For example...:

▶ hashin --version
0.12.0
▶ time hashin --algorithm sha512 lxml
hashin --algorithm sha512 lxml  1.56s user 0.64s system 5% cpu 38.171 total

(The reason this took 38 seconds instead of 25 in the run above is because of the unpredictability of the speed of my home broadband)

Worlds simplest web scraper bot in Python

16 March 2018 1 comment   Python


I just needed a little script to click around a bunch of pages synchronously. It just needed to load the URLs. Not actually do much with the content. Here's what I hacked up:

import random
import requests
from pyquery import PyQuery as pq
from urllib.parse import urljoin


session = requests.Session()
urls = []


def run(url):
    if len(urls) > 100:
        return
    urls.append(url)
    html = session.get(url).decode('utf-8')
    try:
        d = pq(html)
    except ValueError:
        # Possibly weird Unicode errors on OSX due to lxml.
        return
    new_urls = []
    for a in d('a[href]').items():
        uri = a.attr('href')
        if uri.startswith('/') and not uri.startswith('//'):
            new_url = urljoin(url, uri)
            if new_url not in urls:
                new_urls.append(new_url)
    random.shuffle(new_urls)
    [run(x) for x in new_urls]

run('http://localhost:8000/')

If you want to do this when the user is signed in, go to the site in your browser, open the Network tab on your Web Console and copy the value of the Cookie request header.
Change that session.get(url) to something like:

html = session.get(url, headers={'Cookie': 'sessionid=i49q3o66anhvpdaxgldeftsul78bvrpk'}).decode('utf-8')

Now it can spider bot around on your site for a little while as if you're logged in.

Dirty. Simple. Fast.