Peterbe.com

A blog and website by Peter Bengtsson

Best EXPLAIN ANALYZE benchmark script

19 April 2018 0 comments   PostgreSQL, Python

https://gist.github.com/peterbe/966effb3f357258ddda5aa8ac385b418


tl;dr; Use best-explain-analyze.py to benchmark a SQL query in Postgres.

I often benchmark SQL by extracting the relevant SQL string, prefix it with EXPLAIN ANALYZE, putting it into a file (e.g. benchmark.sql) and then running psql mydatabase < benchmark.sql. That spits out something like this:

                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using main_song_ca949605 on main_song  (cost=0.43..237.62 rows=1 width=4) (actual time=1.586..1.586 rows=0 loops=1)
   Index Cond: (artist_id = 27451)
   Filter: (((name)::text % 'Facing The Abyss'::text) AND (id <> 2856345))
   Rows Removed by Filter: 170
 Planning time: 3.335 ms
 Execution time: 1.701 ms
(6 rows)

Cool. So you study the steps of the query plan and look for "Seq Scan" and various sub-optimal uses of heaps and indices etc. But often, you really want to just look at the Execution time milliseconds number. Especially if you might have to slightly different SQL queries to compare and contrast.

However, as you might have noticed, the number on the Execution time varies between runs. You might think nothing's changed but Postgres might have warmed up some internal caches or your host might be more busy or less busy. To remedy this, you run the EXPLAIN ANALYZE select ... a couple of times to get a feeling for an average. But there's a much better way!

best-explain-analyze.py

Check this out: best-explain-analyze.py

Download it into your ~/bin/ and chmod +x ~/bin/best-explain-analyze.py. I wrote it just this morning so don't judge!

Now, when you run it it runs that thing 10 times (by default) and reports the best Execution time, its mean and its median. Example output:

▶ best-explain-analyze.py songsearch dummy.sql
EXECUTION TIME
    BEST    1.229ms
    MEAN    1.489ms
    MEDIAN  1.409ms
PLANNING TIME
    BEST    1.994ms
    MEAN    4.557ms
    MEDIAN  2.292ms

The "BEST" is an important metric. More important than mean or median.

Raymond Hettinger explains it better than I do. His context is for benchmarking Python code but it's equally applicable:

"Use the min() rather than the average of the timings. That is a recommendation from me, from Tim Peters, and from Guido van Rossum. The fastest time represents the best an algorithm can perform when the caches are loaded and the system isn't busy with other tasks. All the timings are noisy -- the fastest time is the least noisy. It is easy to show that the fastest timings are the most reproducible and therefore the most useful when timing two different implementations."

Efficient many-to-many field lookup in Django REST Framework

11 April 2018 0 comments   PostgreSQL, Django, Python


The basic setup

Suppose you have these models:

from django.db import models


class Category(models.Model):
    name = models.CharField(max_length=100)


class Blogpost(models.Model):
    title = models.CharField(max_length=100)
    categories = models.ManyToManyField(Category)

Suppose you hook these up Django REST Framework and list all Blogpost items. Something like this:

# urls.py
from rest_framework import routers
from . import views


router = routers.DefaultRouter()
router.register(r'blogposts', views.BlogpostViewSet)
# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

What's the problem?

Then, if you execute this list (e.g. curl http://localhost:8000/api/blogposts/) what will happen, on the database, is something like this:

SELECT "app_blogpost"."id", "app_blogpost"."title" FROM "app_blogpost";

SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1025;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 193;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 757;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 853;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1116;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1126;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 964;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 591;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1112;
SELECT "app_category"."id", "app_category"."name" FROM "app_category" INNER JOIN "app_blogpost_categories" ON ("app_category"."id" = "app_blogpost_categories"."category_id") WHERE "app_blogpost_categories"."blogpost_id" = 1034;
...

Obviously, it depends on how you define that serializers.BlogpostSerializer class, but basically, as it loops over the Blogpost, for each and every one, it needs to make a query to the many-to-many table (app_blogpost_categories in this example).

That's not going to be performant. In fact, it might be dangerous on your database if the query of blogposts gets big, like requesting a 100 or 1,000 records. Fetching 1,000 rows from the app_blogpost table might be cheap'ish but doing 1,000 selects with JOIN is never going to be cheap. It adds up horribly.

How you solve it

The trick is to only do 1 query on the many-to-many field's table, 1 query on the app_blogpost table and 1 query on the app_category table.

First you have to override the ViewSet.list method. Then, in there you can do exactly what you need.

Here's the framework for this change:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    # queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def get_queryset(self):
        # Chances are, you're doing something more advanced here 
        # like filtering.
        Blogpost.objects.all().order_by('date')

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        # Where the magic happens!

        return response

Next, we need to make a mapping of all Category.id -1-> Category.name. But we want to make sure we do only on the categories that are involved in the Blogpost records that matter. You could do something like this:

category_names = {}
for category in Category.objects.all():
    category_names[category.id] = category.name

But to avoid doing a lookup of category names for those you never need, use the query set on Blogpost. I.e.

qs = self.get_queryset()
all_categories = Category.objects.filter(
    id__in=Blogpost.categories.through.objects.filter(
        blogpost__in=qs
    ).values('category_id')
)
category_names = {}
for category in all_categories:
    category_names[category.id] = category.name

Now you have a dictionary of all the Category IDs that matter.

Note! The above "optimization" assumes that it's worth it. Meaning, if the number of Category records in your database is huge, and the Blogpost queryset is very filtered, then it's worth only extracting a subset. Alternatively, if you only have like 100 different categories in your database, just do the first variant were you look them up "simplestly" without any fancy joins.

Next, is the mapping of Blogpost.id -N-> Category.name. To do that you need to build up a dictionary (int to list of strings). Like this:

categories_map = defaultdict(list)
for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
    categories_map[m2m.blogpost_id].append(
        category_names[m2m.category_id]
    )

So what we have now is a dictionary whose keys are the IDs in self.get_queryset() and each value is a list of a strings. E.g. ['Category X', 'Category Z'] etc.

Lastly, we need to put these back into the serialized response. This feels a little hackish but it works:

for each in response.data:
    each['categories'] = categories_map.get(each['id'], [])

The whole solution looks something like this:

# views.py
from rest_framework import viewsets

class BlogpostViewSet(viewsets.ModelViewSet):
    # queryset = Blogpost.objects.all().order_by('date')
    serializer_class = serializers.BlogpostSerializer

    def get_queryset(self):
        # Chances are, you're doing something more advanced here 
        # like filtering.
        Blogpost.objects.all().order_by('date')

    def list(self, request, *args, **kwargs):
        response = super().list(request, *args, **kwargs)
        qs = self.get_queryset()
        all_categories = Category.objects.filter(
            id__in=Blogpost.categories.through.objects.filter(
                blogpost__in=qs
            ).values('category_id')
        )
        category_names = {}
        for category in all_categories:
            category_names[category.id] = category.name

        categories_map = defaultdict(list)
        for m2m in Blogpost.categories.through.objects.filter(blogpost__in=qs):
            categories_map[m2m.blogpost_id].append(
                category_names[m2m.category_id]
            )

        for each in response.data:
            each['categories'] = categories_map.get(each['id'], [])

        return response

It's arguably not very pretty but doing 3 tight queries instead of doing as many queries as you have records is much better. O(c) is better than O(n).

Discussion

Perhaps the best solution is to not run into this problem. Like, don't serialize any many-to-many fields.

Or, if you use pagination very conservatively, and only allow like 10 items per page then it won't be so expensive to do one query per every many-to-many field.

hashin 0.12.0 is much much faster

20 March 2018 0 comments   Python


tl;dr; The new 0.12.0 version of hashin is between 6 and 30 times faster.

Version 0.12.0 is exciting because it switches from using https://pypi.python.org/pypi/<package>/json to https://pypi.org/pypi/<package>/json so it's using the new PyPI. I only last week found out about the JSON containing .digest.sha256 as part of the JSON even though apparently it's been there for almost a year!

Prior to version 0.12.0, what hashin used to do is download every tarball and .whl file and run pip on it, in Python, to get the checksum hash. Now, if you use the default sha256, that checksum value is immediately available right there in the JSON, for every file per release. This is especially important for binary packages (lxml for example) where it has to download a lot.

To test this, I cleared my temp directory of any previously downloaded Django-* and lxml-* files and used hashin 0.11.5 to fill a requirements.txt for Django and lxml:

▶ hashin --version
0.11.5
▶ time hashin Django
hashin Django  0.48s user 0.14s system 12% cpu 5.123 total
▶ time hashin lxml
hashin lxml  1.61s user 0.59s system 8% cpu 25.361 total

In other words, 5.1 seconds to get the hashes for Django and 25.4 seconds for lxml.
Now, let's do it with the new 0.12.0

▶ hashin --version
0.12.0
▶ mv requirements.txt 0.11.5-requirements.txt ; touch requirements.txt
▶ time hashin Django
hashin Django  0.34s user 0.06s system 46% cpu 0.860 total
▶ time hashin lxml
hashin lxml  0.35s user 0.06s system 44% cpu 0.909 total

So, instead of 5.1 seconds, now it only takes 0.9 seconds. And instead of 25.4 seconds, now it only takes 0.9 seconds.

Yay!

Note, the old code that downloads and runs pip is still there. It kicks in when you request a digest checksum that is not included in the JSON. For example...:

▶ hashin --version
0.12.0
▶ time hashin --algorithm sha512 lxml
hashin --algorithm sha512 lxml  1.56s user 0.64s system 5% cpu 38.171 total

(The reason this took 38 seconds instead of 25 in the run above is because of the unpredictability of the speed of my home broadband)

Worlds simplest web scraper bot in Python

16 March 2018 1 comment   Python


I just needed a little script to click around a bunch of pages synchronously. It just needed to load the URLs. Not actually do much with the content. Here's what I hacked up:

import random
import requests
from pyquery import PyQuery as pq
from urllib.parse import urljoin


session = requests.Session()
urls = []


def run(url):
    if len(urls) > 100:
        return
    urls.append(url)
    html = session.get(url).decode('utf-8')
    try:
        d = pq(html)
    except ValueError:
        # Possibly weird Unicode errors on OSX due to lxml.
        return
    new_urls = []
    for a in d('a[href]').items():
        uri = a.attr('href')
        if uri.startswith('/') and not uri.startswith('//'):
            new_url = urljoin(url, uri)
            if new_url not in urls:
                new_urls.append(new_url)
    random.shuffle(new_urls)
    [run(x) for x in new_urls]

run('http://localhost:8000/')

If you want to do this when the user is signed in, go to the site in your browser, open the Network tab on your Web Console and copy the value of the Cookie request header.
Change that session.get(url) to something like:

html = session.get(url, headers={'Cookie': 'sessionid=i49q3o66anhvpdaxgldeftsul78bvrpk'}).decode('utf-8')

Now it can spider bot around on your site for a little while as if you're logged in.

Dirty. Simple. Fast.

filterToQueryString - JavaScript function to turn current filter into a query string

15 March 2018 0 comments   ReactJS, Javascript, Web development


tl;dr; this function:

export const filterToQueryString = (filterObj, overrides) => {
  const copy = Object.assign(filterObj, overrides || {})
  const searchParams = new URLSearchParams()
  Object.entries(copy).forEach(([key, value]) => {
    if (Array.isArray(value) && value.length) {
      value.forEach(v => searchParams.append(key, v))
    } else if (value) {
      searchParams.set(key, value)
    }
  })
  searchParams.sort()
  return searchParams.toString()
}

I have a React project that used to use query-string to serialize and deserialize objects between React state and URL query strings. Yesterday version 6.0.0 came out and now I'm getting this error during yarn run build:

yarn run v1.5.1
$ react-scripts build
Creating an optimized production build...
Failed to compile.

Failed to minify the code from this file: 

    ./node_modules/query-string/index.js:8 

Read more here: http://bit.ly/2tRViJ9

error An unexpected error occurred: "Command failed.
Exit code: 1

Perhaps this is the wake up call to switch to URLSearchParams (documentation here). Yes it is. Let's do it.

My use case is that I store a dictionary of filters in React this.state. The filter object is updated by submitting a form that looks like this:

Fitler form

Since the form inputs might be empty strings my filter dictionary in this.state might look like this:

{
  user: '@mozilla.com', 
  created_at: 'yesterday', 
  size: '>= 1m, <300G', 
  uploaded_at: ''
}

What I want that to become is: created_at=yesterday&size=>%3D+1m%2C+<300G&user=%40mozilla.com
So it's important to be able to skip falsy values (empty strings or possibly empty arrays).

Sometimes there are other key-values that needs to be added that isn't part of what the user chose. So it needs to be easy to squeeze in additional key-values. Here's the function:

export const filterToQueryString = (filterObj, overrides) => {
  const copy = Object.assign(filterObj, overrides || {})
  const searchParams = new URLSearchParams()
  Object.entries(copy).forEach(([key, value]) => {
    if (Array.isArray(value) && value.length) {
      value.forEach(v => searchParams.append(key, v))
    } else if (value) {
      searchParams.set(key, value)
    }
  })
  searchParams.sort()
  return searchParams.toString()
}

I use it like this:

_fetchUploadsNewCountLoop = () => {
  const qs = filterToQueryString(this.state.filter, {
    created_at: '>' + this.state.latestUpload
  })
  const url = '/api/uploads?' + qs
  ...
  fetch(...)
}

Now using minimalcss

12 March 2018 0 comments   Node, Javascript, Web development, Python


tl;dr; minimalcss is much better than mincss to slew out the minimal CSS your page needs to render. More accurate and more powerful features. This site now uses minimalcss in inline the minimum CSS needed to render the page.

I started minimalcss back in August 2017 and its goal was ultimately to replace mincss.

The major difference between minimalcss and mincss isn't that one is Node and one is Python, but that minimalcss is based on a full headless browser to handle all the CSS downloading and the proper rendering of the DOM. The other major difference is that mincss was based in regular expressions to analyze the CSS and minimalcss is based on proper abstract syntax tree ("AST") implemented by csso.

Because minimalcss is AST based, it can do a lot more. Smarter. For example, it's able to analyze the CSS to correctly and confidently figure out if any/which keyframe animations and font-face at-rules are actually needed.
Also, because minimalcss is based on csso, when it minifies the CSS it's able to restructure the CSS in a safe and smart way. I.e. p { color: blue; } h2 { color: blue; } becomes p,h2{color:blue}.

So, now I use minimalcss here on this blog. The pages are rendered in Django and a piece of middleware sniffs all outgoing HTML responses and depending on the right conditions it dumps the HTML as a file on disk as path/in/url/index.html. Then, that newly created file is sent to a background worker in Celery which starts post-processing it. Every index.html file is accompanied with the full absolute URL that it belongs to and that's the URL that gets sent to minimalcss which returns the absolute minimal CSS the page needs to load and lastly, a piece of Python script basically does something like this:

From...

<!-- before -->
<link rel="stylesheet" href="/file.css"/>

To...

<!-- after -->
<noscript><link rel="stylesheet" href="/file.css"/></noscript>
<style> ... /* minimal CSS selectors for rendering from /file.css */ ... </style>

There is also a new JavaScript dependency which is the cssrelpreload.js from the loadCSS project. So all the full (original) CSS is still downloaded and inserted into the CSSOM but it happens much later which ultimately means the page can be rendered and useful much sooner than if we'd have to wait to download and parse all of the .css URLs.

I can go into more details if there's interest and others want to do this too. Because this site is all Python and minimalcss is all Node, the integration is done over HTTP on localhost with minimalcss-server.

The results

Unfortunately, this change was mixed in with other smaller optimizations that makes the comparison unfair. (Hey! my personal blog is just a side-project after all). But I downloaded a file before and after the upgrade and compared:

▶ ls -lh *.html
-rw-r--r--  1 peterbe  wheel    19K Mar  7 13:22 after.html
-rw-r--r--  1 peterbe  wheel    96K Mar  7 13:21 before.html

If I extract out the inline style block from both pages and compare it looks like this:
https://gist.github.com/peterbe/fc2fdddd5721fb35a99dc1a50c2b5311

So, downloading the initial HTML document is now 19KB instead of previous 96KB. And visually there's absolutely no difference.

Granted, in the after.html version, a piece of JavaScript kicks in and downloads /static/css/base.min.91f6fc577a60.css and /static/css/base-dynamic.min.e335b9bfa0b1.css from the CDN. So you have to download these too:

▶ ls -lh *.css.gz
-rw-r--r--  1 peterbe  wheel   5.0K Mar  7 10:24 base-dynamic.min.e335b9bfa0b1.css.gz
-rw-r--r--  1 peterbe  wheel    95K Mar  7 10:24 base.min.91f6fc577a60.css.gz

The reason the difference appears to be huge is because I changed a couple of other things around the same time. Sorry. For example, certain DOM nodes were rendered as HTML but made hidden until some jQuery script made it not hidden anymore. For example, the "dimmer" effect over a comment textarea after you hit the submit button. Now, I've changed the jQuery code to build up the DOM when it needs it rather than relying on it being there (hidden). This means that certain base64 embedded font-faces are no longer needed in the minimal CSS payload.

Why this approach is better

So the old approach was to run mincss on the HTML and inject that as an inline style block and throw away the original (relevant) <link rel="stylesheet" href="..."> tags.
That had the annoying drawback that there was CSS in the stylesheets that I knew was going to be needed by some XHR or JavaScript later. For example, if you post a comment some jQuery code changes the DOM and that new DOM needs these CSS selectors later. So I had to do things like this:

.project a.perm { /* no mincss */
    font-size: 0.7em;
    padding-left: 8px;
}
.project a.perm:link { /* no mincss */
    color: rgb(151,151,151);
}
.project a.perm:hover { /* no mincss */
    color: rgb(51,51,51);
}

This was to inform mincss to leave those untouched even though no DOM node uses them right now. With minimalcss this is no longer needed.

What's next?

Keep working on minimalcss and make it even better.

Also, the scripting I used to modify the HTML file is a hack and should probably be put into the minimalcss project.

Last but not least, every time I put in some effort to web performance optimize my blog pages my Google ranking goes up and I usually see an increase in Google referrals in my Google Analytics because it's pretty obvious that Google loves fast sites. So I'm optimistically waiting for that effect.

Docker gotcha with building a Dockerfile in sub directory

02 March 2018 0 comments   Docker


tl;dr; Watch out for .dockerignore causing no such file or directory when building Docker images

First I tried to use docker-compose:

▶ docker-compose build ui
Building ui
Step 1/8 : FROM node:9
 ---> 29831ba76d93
Step 2/8 : ADD ui/package.json /package.json
ERROR: Service 'ui' failed to build: ADD failed: stat /var/lib/docker/tmp/docker-builder079614651/ui/package.json: no such file or directory

What the heck? Did I typo the name in the ui/Dockerfile?

The docker-compose.yml directive for this was:

yaml
  ui:
    build:
      context: .
      dockerfile: ui/Dockerfile
    environment:
      - NODE_ENV=development
    ports:
      - "3000:3000"
      - "35729:35729"
    volumes:
      - $PWD/ui:/app
    command: start

I don't know if it's the awesomest way to do docker-compose but I did copy this exactly from a different (working!) project. That other project called the web stuff frontend instead of ui in this project.

The Dockerfile looked like this:

FROM node:9

ADD ./ui/package.json ./
ADD ./ui/package-lock.json ./

RUN npm install

EXPOSE 3000
EXPOSE 35729

CMD [ "npm", "start" ]

Let's try without docker-compose. Or rather, do with docker what docker-compose does for me automatically.

▶ docker build ui -f ui/Dockerfile
Sending build context to Docker daemon  158.2MB
Step 1/8 : FROM node:9
 ---> 29831ba76d93
Step 2/8 : ADD ui/package.json /package.json
ADD failed: stat /var/lib/docker/tmp/docker-builder001494654/ui/package.json: no such file or directory

So I thought I perhaps have misunderstood how relative paths worked. I tried EVERYTHING!

I tried changing to docker build . -f ui/Dockerfile. No luck.

I tried all sorts of combinations of this with also changing the line to ADD ui/package.json ... or ADD /ui/package.json ... or ADD ./package.json ... or ADD package.json .... Nothing worked.

Finally I got it to work by doing

▶ cd ui
▶ docker build . -f Dockerfile

but for that to work I had to remove all references of the directory name ui in the Dockerfile. Nice, but this is not going to work in docker-compose.yml since that starts outside the directory ./ui/.

Sigh!

So then I learned about contexts in docker. Well, I skimmed the docs rapidly. To keep things clean it's a good idea to do things within the directory that matters. So I made this change in the docker-compose.yml:

  ui:
    build:
      context: ui
      dockerfile: Dockerfile
    environment:
      - NODE_ENV=development
    ports:
      - "3000:3000"
      - "35729:35729"
    volumes:
      - $PWD/ui:/app
    command: start

(Note the build.context: and build.dockerfile:)

Still doesn't work! 😫 Still various variations of no such file or directory.

The solution

Turns out, in projectroot/.dockerignore it had ui/ as a line entry!!!

I believe this project used to do some of the Python stuff with Docker and the React web app was done "locally" on the host. And since the ui/node_modules directory is so huge someone decided it was smart of avoid Docker mounting that.

Now the .dockerignore has .ui/node_modules and now everything works. I can build it with plain docker and docker-compose from outside the directory.

Perhaps I should have spent the time I spent writing this blog post to instead file a structured GitHub issue on Docker itself somewhere. I.e. that it should have warned be better. Any takers?

How to throttle AND debounce an autocomplete input in React

01 March 2018 0 comments   ReactJS, Javascript, Web development


Let's start with some best practices for a good autocomplete input:

'f' - most common search term on Google

To demonstrate these best practises, I'm going to use React with a mocked-out network request and mocked out UI for actual drop-down of options that usually appears underneath the input widget.

The Most Basic Version

In this version we have an event listener on every onChange and send the value of the input to the autocomplete function (called _fetch in this example):

class App extends React.Component {
  state = { q: "" };

  changeQuery = event => {
    this.setState({ q: event.target.value }, () => {
      this.autocompleteSearch();
    });
  };

  autocompleteSearch = () => {
    this._fetch(this.state.q);
  };

  _fetch = q => {
    const _searches = this.state._searches || [];
    _searches.push(q);
    this.setState({ _searches });
  };

  render() {
    const _searches = this.state._searches || [];
    return (
      <div>
        <input
          placeholder="Type something here"
          type="text"
          value={this.state.q}
          onChange={this.changeQuery}
        />
        <hr />
        <ol>
          {_searches.map((s, i) => {
            return <li key={s + i}>{s}</li>;
          })}
        </ol>
      </div>
    );
  }
}

You can try it here: No Throttle or Debounce

Note, when use it that an autocomplete lookup is done for every single change to the input (characters typed in or whole words pasted in). Typing in "Alask" at a normal speed our make an autocomplete lookup for "a", "al", "ala", "alas", and "alask".

Also worth pointing out, if you're on a CPU limited device, even if the autocomplete lookups can be done without network requests (e.g. you have a local "database" in-memory) there's still expensive DOM updates for that needs to be done for every single character/word typed in.

Throttled

What a throttle does is that it triggers predictably after a certain time. Every time. Basically, it's it prevents excessive or repeated calling of another function but doesn't get reset.

So if you type "t h r o t t l e" at a speed of 1 key press per 500ms the whole thing will take 8x500ms=3s and if you have a throttle on that, with a delay of 1s, it will fire 4 times.

I highly recommend using throttle-debounce to actually do the debounce. Let's rewrite our demo to use debounce:

import { throttle } from "throttle-debounce";

class App extends React.Component {
  constructor(props) {
    super(props);
    this.state = { q: "" };
    this.autocompleteSearchThrottled = throttle(500, this.autocompleteSearch);
  }

  changeQuery = event => {
    this.setState({ q: event.target.value }, () => {
      this.autocompleteSearchThrottled(this.state.q);
    });
  };

  autocompleteSearch = q => {
    this._fetch(q);
  };

  _fetch = q => {
    const _searches = this.state._searches || [];
    _searches.push(q);
    this.setState({ _searches });
  };

  render() {
    const _searches = this.state._searches || [];
    return (
      <div>
        <h2>Throttle</h2>
        <p>½ second Throttle triggering the autocomplete on every input.</p>
        <input
          placeholder="Type something here"
          type="text"
          value={this.state.q}
          onChange={this.changeQuery}
        />
        <hr />
        {_searches.length ? (
          <button
            type="button"
            onClick={event => this.setState({ _searches: [] })}
          >
            Reset
          </button>
        ) : null}
        <ol>
          {_searches.map((s, i) => {
            return <li key={s + i}>{s}</li>;
          })}
        </ol>
      </div>
    );
  }
}

One thing to notice on the React side is that the autocompleteSearch method can no longer use this.state.q because the function gets executed by the throttle function so the this is different. That's why, in this version we pass the search term as an argument instead.

You can try it here: Throttle

If you type something reasonably fast you'll notice it fires a couple of times. It's quite possible that if you type a bunch of stuff, with your eyes on the keyboard, by the time you're done you'll see it made a bunch of (mocked) autocomplete lookups whilst you weren't paying attention. You should also notice that it fired on the very first character you typed.

A cool feature about this is that if you can afford the network lookups, the interface will feel snappy. Hopefully, if your server is fast to respond to the autocomplete lookups there are quickly some suggestions there. At least it's a great indicator that the autocomplete UX is a think the user can expect as she types more.

Debounce

An alternative approach is to use a debounce. From the documentation of throttle-debounce:

"Debouncing, unlike throttling, guarantees that a function is only executed a single time, either at the very beginning of a series of calls, or at the very end."

Basically, ever time you "pile something on" it discards all the other delayed executions. Changing to this version is easy. just change import { throttle } from "throttle-debounce"; to import { debounce } from "throttle-debounce"; and change this.autocompleteSearchThrottled = throttle(1000, this.autocompleteSearch); to this.autocompleteSearchDebounced = debounce(1000, this.autocompleteSearch);

Here is the debounce version:

import { debounce } from "throttle-debounce";

class App extends React.Component {
  constructor(props) {
    super(props);
    this.state = { q: "" };
    this.autocompleteSearchDebounced = debounce(500, this.autocompleteSearch);
  }

  changeQuery = event => {
    this.setState({ q: event.target.value }, () => {
      this.autocompleteSearchDebounced(this.state.q);
    });
  };

  autocompleteSearch = q => {
    this._fetch(q);
  };

  _fetch = q => {
    const _searches = this.state._searches || [];
    _searches.push(q);
    this.setState({ _searches });
  };

  render() {
    const _searches = this.state._searches || [];
    return (
      <div>
        <h2>Debounce</h2>
        <p>
          ½ second Debounce triggering the autocomplete on every input.
        </p>
        <input
          placeholder="Type something here"
          type="text"
          value={this.state.q}
          onChange={this.changeQuery}
        />
        <hr />
        {_searches.length ? (
          <button
            type="button"
            onClick={event => this.setState({ _searches: [] })}
          >
            Reset
          </button>
        ) : null}
        <ol>
          {_searches.map((s, i) => {
            return <li key={s + i}>{s}</li>;
          })}
        </ol>
      </div>
    );
  }
}

You can try it here: Throttle

If you try it you'll notice that if you type at a steady pace (under 1 second for each input), it won't really trigger any autocomplete lookups at all. It basically triggers when you take your hands off the keyboard. But the silver lining with this approach is that if you typed "This is my long search input" it didn't bother looking things up for "this i", "this is my l", "this is my long s", "this is my long sear", "this is my long search in" since they are probably not very useful.

Best of Both World; Throttle and Debounce

The throttle works great in the beginning when you want the autocomplete widget to seem eager but if the user starts typing in a lot, you'll want to be more patient. It's quite human. If a friend is trying to remember something you're probably at first really quick to try to help with suggestions, but once you friend starts to remember and can start reciting, you patiently wait a bit more till they have said what they're going to say.

In this version we're going to use throttle (the eager one) in the beginning when the input is short and debounce (the patient one) when user has ignored the first autocomplete inputs and starting typing something longer.

Here is the version that uses both:

import { throttle, debounce } from "throttle-debounce";

class App extends React.Component {
  constructor(props) {
    super(props);
    this.state = { q: ""};
    this.autocompleteSearchDebounced = debounce(500, this.autocompleteSearch);
    this.autocompleteSearchThrottled = throttle(500, this.autocompleteSearch);
  }

  changeQuery = event => {
    this.setState({ q: event.target.value }, () => {
      const q = this.state.q;
      if (q.length < 5) {
        this.autocompleteSearchThrottled(this.state.q);
      } else {
        this.autocompleteSearchDebounced(this.state.q);
      }
    });
  };

  autocompleteSearch = q => {
    this._fetch(q);
  };

  _fetch = q => {
    const _searches = this.state._searches || [];
    _searches.push(q);
    this.setState({ _searches });
  };

  render() {
    const _searches = this.state._searches || [];
    return (
      <div>
        <h2>Throttle and Debounce</h2>
        <p>
          ½ second Throttle when input is small and ½ second Debounce when
          the input is longer.
        </p>
        <input
          placeholder="Type something here"
          type="text"
          value={this.state.q}
          onChange={this.changeQuery}
        />
        <hr />
        {_searches.length ? (
          <button
            type="button"
            onClick={event => this.setState({ _searches: [] })}
          >
            Reset
          </button>
        ) : null}
        <ol>
          {_searches.map((s, i) => {
            return <li key={s + i}>{s}</li>;
          })}
        </ol>
      </div>
    );
  }
}

In this version I cheated a little bit. The delays are different. The throttle has a delay of 500ms and the debounce as a delay of 1000ms. That makes it feel little bit more snappy there in the beginning when you start typing but once you've typed more than 5 characters, it switches to the more patient debounce version.

You can try it here: Throttle and Debounce

With this version, if you, in a steady pace typed in "south carolina" you'd notice that it does autocomplete lookups for "s", "sout" and "south carolina".

Avoiding wrongly ordered async responses

Suppose the user slowly types in "p" then "pe" then "pet", it would trigger 3 XHR requests. I.e. something like this:

fetch('/autocomplete?q=p')

fetch('/autocomplete?q=pe')

fetch('/autocomplete?q=pet')

But because all of these are asynchronous and sometimes there's unpredictable slowdowns on the network, it's not guarantee that they'll all come back in the same exact order. The solution to this is to use a "global variable" of the latest search term and then compare that to the locally scoped search term in each fetch callback promise. That might sound harder than it is. The solution basically looks like this:

class App extends React.Component {

  makeAutocompleteLookup = q => {
    // Store the latest input here scoped in the App instance.
    this.waitingFor = q;
    fetch('/autocompletelookup?q=' + q)
    .then(response => {
      if (response.status === 200) {
        // Only bother with this XHR response
        // if this query term matches what we're waiting for.
        if (q === this.waitingFor) {
          response.json()
          .then(results => {
              this.setState({results: results});
          })
        }
      }
    })
  }
}

Bonus feature; Caching

For caching the XHR requests, to avoid unnecessary network requests if the user uses backspace, the simplest solution is to maintain a dictionary of previous results as a component level instance. Let's assume you do the XHR autocomplete lookup like this initially:

class App extends React.Component {

  makeAutocompleteLookup = q => {
    const url = '/autocompletelookup?q=' + q;
    fetch(url)
    .then(response => {
      if (response.status === 200) {
        response.json()
        .then(results => {
            this.setState({ results });
        })
      }
    })
  }

}

To add caching (also a form of memoization) you can simply do this:

class App extends React.Component {

  _autocompleteCache = {};

  makeAutocompleteLookup = q => {
    const url = '/autocompletelookup?q=' + q;

    const cached = this._autocompleteCache[url];
    if (cached) {
      return Promise.resolve(cached).then(results => {
        this.setState({ results });
        });
      });
    }

    fetch(url)
    .then(response => {
      if (response.status === 200) {
        response.json()
        .then(results => {
            this.setState({ results });
        })
      }
    })
  }

}

In a more real app you might want to make that whole method always return a promise. And you might want to do something slightly smarter when response.status !== 200.

Bonus feature; Watch out for spaces

So the general gist of these above versions is that you debounce the XHR autocomplete lookups to only trigger sometimes. For short strings we trigger every, say, 300ms. When the input is longer, we only trigger when it appears the user has stopped typing. A more "advanced" approach is to trigger after a space. If I type "south carolina is a state" it's hard for a computer to know if "is", "a", or "state" is a complete word. Humans know and some English words can easily be recognized as stop words. However, what you can do is take advantage of the fact that a space almost always means the previous word was complete. It would be nice to trigger an autocomplete lookup after "south carolina" and "south carolina is" and "south carolina is a". These are also easier to deal with on the server side because, depending on your back-end, it's easier to search your database if you don't include "broken" words like "south carolina is a sta". To do that, here's one such implementation:

class App extends React.Component {

  // Just overriding the changeQuery method in this example.

  changeQuery = event => {
    const q = event.target.value
    this.setState({ q }, () => {

      // If the query term is short or ends with a
      // space, trigger the more impatient version.
      if (q.length < 5 || q.endsWith(' ')) {
        this.autocompleteSearchThrottled(q);
      } else {
        this.autocompleteSearchDebounced(q);
      }
    });
  };

  // Just overriding the changeQuery method in this example.

}

You can try it here: Throttle and Debounce with throttle on ending spaces.

Next level stuff

There is so much more that you can do for that ideal user experience. A lot depends on the context.

For example, when the input is small instead of doing a search on titles or names or whatever, you instead return a list of possible full search terms. So, if I have typed "sou" the back-end could return things like:

{
  "matches": [
     {"term": "South Carolina", "count": 123},
     {"term": "Southern", "count": 469},
     {"term": "South Dakota", "count": 98},
  ]
}

If the user selects one of these autocomplete suggestions, instead of triggering a full search you just append the selected match back into the search input widget. This is what Google does.

And if the input is longer you go ahead and actually search for the full documents. So if the input was "south caro" you return something like this:

{
  "matches": [
     {
       "title": "South Carolina Is A State", 
       "url": "/permapage/x19v093d"
     },
     {
       "title": "Best of South Carolina Parks", 
       "url": "/permapage/9vqif3z"
     },
     {
       "title": "I Live In South Carolina", 
       "url": "/permapage/abc300a1y"
     },
  ]
}

And when the XHR completes you look at what the user clicked and do something like this:

  return (<ul className="autocomplete">
    {this.state.results.map(result => {
      return <li onClick={event => {
        if (result.url) {
          document.location.href = result.url;
        } else {
          this.setState({ q: result.term });
        }
      }}>
        {result.url ? (
          <p className="document">{result.title}</p>
        ) : (
            <p className="new-term">{result.term}</p>
          )}
      </li>
    })
    }
    </ul>
  )

This is an incomplete example and more pseudo-code than a real solution but the pattern is quite nice. You're either helping the user type the full search term or if it's already a good match you can go skip the actual searching and go to the result directly.

This is how SongSearch works for example:

Suggestions for full search terms
Suggestions for full search terms

Suggestions for actual documents
Suggestions for actual documents

csso and django-pipeline

28 February 2018 0 comments   Javascript, Django, Python


This is a quick-and-dirty how-to on how to use csso to handle the minification/compression of CSS in django-pipeline.

First create a file called compressors.py somewhere in your project. Make it something like this:

import subprocess
from pipeline.compressors import CompressorBase
from django.conf import settings


class CSSOCompressor(CompressorBase):

    def compress_css(self, css):
        proc = subprocess.Popen(
            [
                settings.PIPELINE['CSSO_BINARY'], 
                '--restructure-off'
            ],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
        )
        css_out = proc.communicate(
            input=css.encode('utf-8')
        )[0].decode('utf-8')
        # was_size = len(css)
        # new_size = len(css_out)
        # print('FROM {} to {} Saved {}  ({!r})'.format(
        #     was_size,
        #     new_size,
        #     was_size - new_size,
        #     css_out[:50]
        # ))
        return css_out

In your settings.py where you configure django-pipeline make it something like this:

PIPELINE = {
    'STYLESHEETS': PIPELINE_CSS,
    'JAVASCRIPT': PIPELINE_JS,

    # These two important lines. 
    'CSSO_BINARY': path('node_modules/.bin/csso'),
    # Adjust the dotted path name to where you put your compressors.py
    'CSS_COMPRESSOR': 'peterbecom.compressors.CSSOCompressor',

    'JS_COMPRESSOR': ...

Next, install csso-cli in your project root (where you have the package.json). It's a bit confusing. The main package is called csso but to have a command line app you need to install csso-cli and when that's been installed you'll have a command line app called csso.

$ yarn add csso-cli

or

$ npm i --save csso-cli

Check that it installed:

$ ./node_modules/.bin/csso --version
3.5.0

And that's it!

--restructure-off

So csso has an advanced feature to restructure the CSS and not just remove whitespace and not needed semicolons. It costs a bit of time to do that so if you want to squeeze the extra milliseconds out, enable it. Trading time for space.
See this benchmark for a comparison with and without --restructure-off in csso.

Why csso you might ask

Check out the latest result from css-minification-benchmark. It's not super easy to read by it seems the best performing one in terms of space (bytes) is crass written by my friend and former colleague @mattbasta. However, by far the fastest is csso when using --restructre-off. Minifiying font-awesome.css with crass takes 326.52 ms versus 3.84 ms in csso.

But what's great about csso is Roman @lahmatiy Dvornov. I call him a friend too for all the help and work he's done on minimalcss (not a CSS minification tool by the way). Roman really understands CSS and csso is actively maintained by him and other smart people who actually get into the scary weeds of CSS browser hacks. That gives me more confidence to recommend csso. Also, squeezing a couple bytes extra out of your .min.css files isn't important when gzip comes into play. It's better that the minification tool is solid and stable.

Check out Roman's slides which, even if you don't read it all, goes to show that CSS minification is so much more than just regex replacing whitespace.
Also crass admits as one of its disadvantages: "Certain "CSS hacks" that use invalid syntax are unsupported".

Items function in JavaScript for looping over dictionaries like Python

23 February 2018 0 comments   ReactJS, Javascript


Too many times I've written code like this:

class MyComponent extends React.PureComponent {
  render() {
    return <ul>
      {Object.keys(this.props.someDictionary).map(key => {
        return <li key={key}><b>{key}:</b> {this.props.someDictionary[key]}</li> 
      })}
    </ul>
  }
}

The clunky thing about this is that you have to reference the dictionary twice. Makes it harder to refactor. In Python, you do this instead:

for key, value in some_dictionary.items():
    print(f'$key: $value')

To do the same in JavaScript make a function like this:

function items(dict, fn) {
  return Object.keys(dict).map((key, i) => {
    return fn(key, dict[key], i)
  })
}

Now you can use it "more like Python":

class MyComponent extends React.PureComponent {
  render() {
    return <ul>
      {items(this.props.someDictionary, (key, value) => {
        return <li key={key}><b>{key}:</b> {value}</li> 
      })}
    </ul>
  }
}

Example on CodeSandbox here

UPDATE

Thanks to @Osmose and @saltycrane for alerting me to Object.entries().

class MyComponent extends React.PureComponent {
  render() {
    return <ul>
      {Object.entries(this.props.someDictionary).map(([key, value]) => {
        return <li key={key}><b>{key}:</b> {value}</li> 
      })}
    </ul>
  }
}

Updated CodeSandbox here