Peterbe.com

Peter Bengtsson's blog

Filtered by Python

Page 2

How to sort case insensitively with empty strings last in Django

April 3, 2022
1 comment Django, Python, PostgreSQL

Imagine you have something like this in Django:


class MyModel(models.Models):
    last_name = models.CharField(max_length=255, blank=True)
    ...

The most basic sorting is either: queryset.order_by('last_name') or queryset.order_by('-last_name'). But what if you want entries with a blank string last? And, you want it to be case insensitive. Here's how you do it:


from django.db.models.functions import Lower, NullIf
from django.db.models import Value


if reverse:
    order_by = Lower("last_name").desc()
else:
    order_by = Lower(NullIf("last_name", Value("")), nulls_last=True)


ALL = list(queryset.values_list("last_name", flat=True))
print("FIRST 5:", ALL[:5])
# Will print either...
#   FIRST 5: ['Zuniga', 'Zukauskas', 'Zuccala', 'Zoller', 'ZM']
# or 
#   FIRST 5: ['A', 'aaa', 'Abrams', 'Abro', 'Absher']
print("LAST 5:", ALL[-5:])
# Will print...
#   LAST 5: ['', '', '', '', '']

This is only tested with PostgreSQL but it works nicely.
If you're curious about what the SQL becomes, it's:


SELECT "main_contact"."last_name" FROM "main_contact" 
ORDER BY LOWER(NULLIF("main_contact"."last_name", '')) ASC


SELECT "main_contact"."last_name" FROM "main_contact" 
ORDER BY LOWER("main_contact"."last_name") DESC

Note that if your table columns is either a string, an empty string, or null, the reverse needs to be: Lower("last_name", nulls_last=True).desc().

How to close a HTTP GET request in Python before the end

March 30, 2022
0 comments Python

Does you server barf if your clients close the connection before it's fully downloaded? Well, there's an easy way to find out. You can use this Python script:


import sys
import requests

url = sys.argv[1]
assert '://' in url, url
r = requests.get(url, stream=True)
if r.encoding is None:
    r.encoding = 'utf-8'
for chunk in r.iter_content(1024, decode_unicode=True):
    break

I use the xh CLI tool a lot. It's like curl but better in some things. By default, if you use --headers it will make a regular GET request but close the connection as soon as it has gotten all the headers. E.g.

▶ xh --headers https://www.peterbe.com
HTTP/2.0 200 OK
cache-control: public,max-age=3600
content-type: text/html; charset=utf-8
date: Wed, 30 Mar 2022 12:37:09 GMT
etag: "3f336-Rohm58s5+atf5Qvr04kmrx44iFs"
server: keycdn-engine
strict-transport-security: max-age=63072000; includeSubdomains; preload
vary: Accept-Encoding
x-cache: HIT
x-content-type-options: nosniff
x-edge-location: usat
x-frame-options: SAMEORIGIN
x-middleware-cache: hit
x-powered-by: Express
x-shield: active
x-xss-protection: 1; mode=block

That's not be confused with doing HEAD like curl -I ....

So either with xh or the Python script above, you can get that same effect. It's a useful trick when you want to make sure your (async) server doesn't attempt to do weird stuff with the "Response" object after the connection has closed.

How to string pad a string in Python with a variable

October 19, 2021
1 comment Python

I just have to write this down because that's the rule; if I find myself googling something basic like this more than once, it's worth blogging about.

Suppose you have a string and you want to pad with empty spaces. You have 2 options:


>>> s = "peter"
>>> s.ljust(10)
'peter     '
>>> f"{s:<10}"
'peter     '

The f-string notation is often more convenient because it can be combined with other formatting directives.
But, suppose the number 10 isn't hardcoded like that. Suppose it's a variable:


>>> s = "peter"
>>> width = 11
>>> s.ljust(width)
'peter      '
>>> f"{s:<width}"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Invalid format specifier

Well, the way you need to do it with f-string formatting, when it's a variable like that is this syntax:


>>> f"{s:<{width}}"
'peter      '

TypeScript function keyword arguments like Python

September 8, 2021
0 comments Python, JavaScript

To do this in Python:


def print_person(name="peter", dob=1979):
    print(f"name={name}\tdob={dob}")


print_person() 
# prints: name=peter dob=1979

print_person(name="Tucker")
# prints: name=Tucker    dob=1979

print_person(dob=2013)
# prints: name=peter dob=2013

print_person(sex="boy")
# TypeError: print_person() got an unexpected keyword argument 'sex'

...in TypeScript:


function printPerson({
  name = "peter",
  dob = 1979
}: { name?: string; dob?: number } = {}) {
  console.log(`name=${name}\tdob=${dob}`);
}

printPerson();
// prints: name=peter    dob=1979

printPerson({});
// prints: name=peter    dob=1979

printPerson({ name: "Tucker" });
// prints: name=Tucker   dob=1979

printPerson({ dob: 2013 });
// prints: name=peter    dob=2013


printPerson({ gender: "boy" })
// Error: Object literal may only specify known properties, and 'gender'

Here's a Playground copy of it.

It's not a perfect "transpose" across the two languages but it's sufficiently similar.
The trick is that last = {} at the end of the function signature in TypeScript which makes it possible to omit keys in the passed-in object.

By the way, the pure JavaScript version of this is:


function printPerson({ name = "peter", dob = 1979 } = {}) {
  console.log(`name=${name}\tdob=${dob}`);
}

But, unlike Python and TypeScript, you get no warnings or errors if you'd do printPerson({ gender: "boy" }); with the JavaScript version.

How to install Python Poetry in GitHub Actions in MUCH faster way

July 27, 2021
0 comments Python

We use Poetry in a GitHub project. There's a pyproject.toml file (and a poetry.lock file) which with the help of the executable poetry gets you a very reliable Python environment. The only problem is that adding the poetry executable is slow. Like 10+ seconds slow. It might seem silly but in the project I'm working on, that 10+s delay is the slowest part of a GitHub Action workflow which needs to be fast because it's trying to post a comment on a pull request as soon as it possibly can.

First I tried caching $(pip cache dir) so that the underlying python -v pip install virtualenv -t $tmp_dir that install-poetry.py does would get a boost from avoiding network. The difference was negligible. I also didn't want to get too weird by overriding how the install-poetry.py works or even make my own hacky copy. I like being able to just rely on the snok/install-poetry GitHub Action to do its thing (and its future thing).

The solution was to cache the whole $HOME/.local directory. It's as simple as this:


- name: Load cached $HOME/.local
  uses: actions/cache@v2.1.6
  with:
    path: ~/.local
    key: dotlocal-${{ runner.os }}-${{ hashFiles('.github/workflows/pr-deployer.yml') }}

The key is important. If you do copy-n-paste this block of YAML to speed up your GitHub Action, please remember to replace .github/workflows/pr-deployer.yml with the name of your .yml file that uses this. It's important because otherwise, the cache might be overzealously hot when you make a change like:


       - name: Install Python poetry
-        uses: snok/install-poetry@v1.1.6
+        uses: snok/install-poetry@v1.1.7
         with:

...for example.

Now, thankfully install-poetry.py (which is the recommended way to install poetry by the way) can notice that it's already been created and so it can omit a bunch of work. The result of this is as follows:

From 10+ seconds to 2 seconds. And what's neat is that the optimization is very "unintrusive" because it doesn't mess with how the snok/install-poetry workflow works.

But wait, there's more!

If you dig up our code where we use poetry you might find that it does a bunch of other caching too. In particular, it caches .venv it creates too. That's relevant but ultimately unrelated. It basically caches the generated virtualenv from the poetry install command. It works like this:


- name: Load cached venv
  id: cached-poetry-dependencies
  uses: actions/cache@v2.1.6
  with:
    path: deployer/.venv
    key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock') }}-${{ hashFiles('.github/workflows/pr-deployer.yml') }}

...

- name: Install deployer
  run: |
    cd deployer
    poetry install
  if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'

In this example, deployer is just the name of the directory, in the repository root, where we have all the Python code and the pyproject.toml etc. If you have yours at the root of the project you can just do: run: poetry install and in the caching step change it to: path: .venv.

Now, you get a really powerful complete caching strategy. When the caches are hot (i.e. no changes to the .yml, poetry.lock, or pyproject.toml files) you get the executable (so you can do poetry run ...) and all its dependencies in roughly 2 seconds. That'll be hard to beat!

An effective and immutable way to turn two Python lists into one

June 23, 2021
7 comments Python

tl;dr; To make 2 lists into 1 without mutating them use list1 + list2.

I'm blogging about this because today I accidentally complicated my own code. From now on, let's just focus on the right way.

Suppose you have something like this:


winners = [123, 503, 1001]
losers = [45, 812, 332]

combined = winners + losers

that will create a brand new list. To prove that it's immutable:

>>> combined.insert(0, 100)
>>> combined
[100, 123, 503, 1001, 45, 812, 332]
>>> winners
[123, 503, 1001]
>>> losers
[45, 812, 332]

What I originally did was:


winners = [123, 503, 1001]
losers = [45, 812, 332]

combined = [*winners, *losers]

This works the same and that syntax feels very JavaScript'y. E.g.

> var winners = [123, 503, 1001]
[ 123, 503, 1001 ]
> var losers = [45, 812, 332]
[ 45, 812, 332 ]
> var combined = [...winners, ...losers]
[ 123, 503, 1001, 45, 812, 332 ]
> combined.pop()
332
> losers
[ 45, 812, 332 ]

By the way, if you want to filter out duplicates, do this:


>>> a = [1, 2, 3]
>>> b = [2, 3, 4]
>>> list(dict.fromkeys(a + b))
[1, 2, 3, 4]

It's the most performant way to do it if the order is important.

And if you don't care about the order you can use this:

>>> a = [1, 2, 3]
>>> b = [2, 3, 4]
>>> list(set(a + b))
[1, 2, 3, 4]
>>> list(set(b + a))
[1, 2, 3, 4]

The correct way to index data into Elasticsearch with (Python) elasticsearch-dsl

May 14, 2021
0 comments Python, MDN, Elasticsearch

This is how MDN Web Docs uses Elasticsearch. Daily, we build all the content and then upload it all using elasticsearch-dsl using aliases. Because there are no good complete guides to do this, I thought I'd write it down for the next person who needs to do something similar. Let's jump straight into the code. The reader will need a healthy dose of imagination to fill in their details.

Indexing


# models.py

from datetime.datetime import utcnow

from elasticsearch_dsl import Document

PREFIX = "myprefix"


class MyDocument(Document):
    title = Text()
    body = Text()
    # ...

    class Index:
        name = (
            f'{PREFIX}_{utcnow().strftime("%Y%m%d%H%M%S")}'
        )

What's important to note here is that the MyDocument.Index.name is dynamically allocated every single time the module is imported. It's not very important exactly what it is called but it's important that it becomes unique each time.
This means that when you start using MyDocument it will automatically figure out which index to use. Now, it's time to create the index and bulk publish it.


# index.py
# Note! This example code skips over things like progress bars
# and verbose logging and misc sanity checks and stuff.

from elasticsearch.helpers import parallel_bulk
from elasticsearch_dsl import Index
from elasticsearch_dsl.connections import connections

from .models import MyDocument, PREFIX


def index(buildroot: Path, url: str, update=False):
    """
    * 'buildroot' is where the files are we're going to read and index
    * 'url' is the host URL for the Elasticsearch server
    * 'update' is if just want to "cake on" a couple of documents 
      instead of starting over and doing a complete indexing.
    """

    # Connect and stuff
    connections.create_connection(hosts=[url], retry_on_timeout=True)
    connection = connections.get_connection()
    health = connection.cluster.health()
    status = health["status"]
    if status not in ("green", "yellow"):
        raise Exception(f"status {status} not green or yellow")

    if update:
        for name in connection.indices.get_alias():
            if name.startswith(f"{PREFIX}_"):
                document_index = Index(name)
                break
        else:
            raise IndexAliasError(
                f"Unable to find an index called {PREFIX}_*"
            )

    else:
        # Confusingly, `._index` is actually not a private API.
        # It's the documented way you're supposed to reach it.
        document_index = MyDocument._index
        document_index.create()

    def generator():
        for doc in Path(buildroot):
            # The reason for specifying the exact index name is that we might
            # be doing an update and if you don't specify it, elasticsearch_dsl
            # will fall back to using whatever Document._meta.Index automatically
            # becomes in this moment.
            yield to_search(doc, _index=document_index._name).to_dict(True)

    for success, info in parallel_bulk(connection, generator()):
        # 'success' is a boolean
        # 'info' has stuff like:
        #  - info["index"]["error"]
        #  - info["index"]["_shards"]["successful"]
        #  - info["index"]["_shards"]["failed"]
        pass

    if update:
        # When you do an update, Elasticsearch will internally delete the
        # previous docs (based on the _id primary key we set).
        # Normally, Elasticsearch will do this when you restart the cluster
        # but that's not something we usually do.
        # See https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
        document_index.forcemerge()
    else:
        # Now we're going to bundle the change to set the alias to point
        # to the new index and delete all old indexes.
        # The reason for doing this together in one update is to make it atomic.
        alias_updates = [
            {"add": {"index": document_index._name, "alias": PREFIX}}
        ]
        for index_name in connection.indices.get_alias():
            if index_name.startswith(f"{PREFIX}_"):
                if index_name != document_index._name:
                    alias_updates.append({"remove_index": {"index": index_name}})
        connection.indices.update_aliases({"actions": alias_updates})

    print("All done!")



def to_search(file: Path, _index=None):
    with open(file) as f:
        data = json.load(f)
    return MyDocument(
        _index=_index,
        _id=data["identifier"],
        title=data["title"],
        body=data["body"]
    )

A lot is left to the reader as an exercise to fill in but these are the most important operations. It demonstrates how you can

Correctly create indexes
Atomically create an alias and clean up old indexes (and aliases)
How you can add to an existing index

After you've run this you'll see something like this:

$ curl http://localhost:9200/_cat/indices?v
...
health status index                   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   myprefix_20210514141421 vulVt5EKRW2MNV47j403Mw   1   1      11629            0     28.7mb         28.7mb

$ curl http://localhost:9200/_cat/aliases?v
...
alias    index                   filter routing.index routing.search is_write_index
myprefix myprefix_20210514141421 -      -             -              -

Searching

When it comes to using the index, well, it depends on where your code for that is. For example, on MDN Web Docs, the code that searches the index is in an entirely different code-base. It's incidentally Python (and elasticsearch-dsl) in both places but other than that they have nothing in common. So for the searching, you need to manually make sure you write down the name of the index (or name of the alias if you prefer) into the code that searches. For example:


from elasticsearch_dsl import Search

def search(params):
    search_query = Search(index=settings.SEARCH_INDEX_NAME)

    # Do stuff to 'search_query' based on 'params'

    response = search_query.execute()   
    for hit in response:
        # ...

If you're within the same code that has that models.MyDocument in the first example code above, you can simply do things like this:


from elasticsearch_dsl import Index
from elasticsearch_dsl.connections import connections

from .models import PREFIX


def analyze(
    url: str,
    text: str,
    analyzer: str,
):
    connections.create_connection(hosts=[url])
    index = Index(PREFIX)
    analysis = index.analyze(body={"text": text, "analyzer": analyzer})
    # ...

Umlauts (non-ascii characters) with git on macOS

March 22, 2021
0 comments Python, macOS

I edit a file called files/en-us/glossary/bézier_curve/index.html and then type git status and I get this:

▶ git status
...
Changes not staged for commit:
  ...
    modified:   "files/en-us/glossary/b\303\251zier_curve/index.html"

...

What's that?! First of all, I actually had this wrapped in a Python script that uses GitPython to analyze the output of for change in repo.index.diff(None):. So I got...

FileNotFoundError: [Errno 2] No such file or directory: '"files/en-us/glossary/b\\303\\251zier_curve/index.html"'

What's that?!

At first, I thought it was something wrong with how I use GitPython and thought I could force some sort of conversion to UTF-8 with Python. That, and to strip the quotation parts with something like path = path[1:-1] if path.startwith('"') else path

After much googling and experimentation, what totally solved all my problems was to run:

▶ git config --global core.quotePath false

Now you get...:

▶ git status
...
Changes not staged for commit:
  ...
    modified:   files/en-us/glossary/bézier_curve/index.html

...

And that also means it works perfectly fine with any GitPython code that does something with the repo.index.diff(None) or repo.index.diff(repo.head.commit).

Also, we I use the git-diff-action GitHub Action which would fail to spot files that contained umlauts but now I run this:


    steps:
       - uses: actions/checkout@v2
+
+      - name: Config git core.quotePath
+        run: git config --global core.quotePath false
+
       - uses: technote-space/get-diff-action@v4.0.6
         id: git_diff_content
         with:

How MDN's site-search works

February 26, 2021
3 comments Web development, Django, Python, MDN, Elasticsearch

tl;dr; Periodically, the whole of MDN is built, by our Node code, in a GitHub Action. A Python script bulk-publishes this to Elasticsearch. Our Django server queries the same Elasticsearch via /api/v1/search. The site-search page is a static single-page app that sends XHR requests to the /api/v1/search endpoint. Search results' sort-order is determined by match and "popularity".

Jamstack'ing

The challenge with "Jamstack" websites is with data that is too vast and dynamic that it doesn't make sense to build statically. Search is one of those. For the record, as of Feb 2021, MDN consists of 11,619 documents (aka. articles) in English. Roughly another 40,000 translated documents. In English alone, there are 5.3 million words. So to build a good search experience we need to, as a static site build side-effect, index all of this in a full-text search database. And Elasticsearch is one such database and it's good. In particular, Elasticsearch is something MDN is already quite familiar with because it's what was used from within the Django app when MDN was a wiki.

Note: MDN gets about 20k site-searches per day from within the site.

Build

When we build the whole site, it's a script that basically loops over all the raw content, applies macros and fixes, dumps one index.html (via React server-side rendering) and one index.json. The index.json contains all the fully rendered text (as HTML!) in blocks of "prose". It looks something like this:


{
  "doc": {
    "title": "DOCUMENT TITLE",
    "summary": "DOCUMENT SUMMARY",
    "body": [
      {
        "type": "prose", 
        "value": {
          "id": "introduction", 
          "title": "INTRODUCTION",
          "content": "<p>FIRST BLOCK OF TEXTS</p>"
       }
     },
     ...
   ],
   "popularity": 0.12345,
   ...
}

You can see one here: /en-US/docs/Web/index.json

Indexing

Next, after all the index.json files have been produced, a Python script takes over and it traverses all the index.json files and based on that structure it figures out the, title, summary, and the whole body (as HTML).

Next up, before sending this into the bulk-publisher in Elasticsearch it strips the HTML. It's a bit more than just turning <p>Some <em>cool</em> text.</p> to Some cool text. because it also cleans up things like <div class="hidden"> and certain <div class="notecard warning"> blocks.

One thing worth noting is that this whole thing runs roughly every 24 hours and then it builds everything. But what if, between two runs, a certain page has been removed (or moved), how do you remove what was previously added to Elasticsearch? The solution is simple: it deletes and re-creates the index from scratch every day. The whole bulk-publish takes a while so right after the index has been deleted, the searches won't be that great. Someone could be unlucky in that they're searching MDN a couple of seconds after the index was deleted and now waiting for it to build up again.
It's an unfortunate reality but it's a risk worth taking for the sake of simplicity. Also, most people are searching for things in English and specifically the Web/ tree so the bulk-publishing is done in a way the most popular content is bulk-published first and the rest was done after. Here's what the build output logs:

Found 50,461 (potential) documents to index
Deleting any possible existing index and creating a new one called mdn_docs
Took 3m 35s to index 50,362 documents. Approximately 234.1 docs/second
Counts per priority prefixes:
    en-us/docs/web                 9,056
    *rest*                         41,306

So, yes, for 3m 35s there's stuff missing from the index and some unlucky few will get fewer search results than they should. But we can optimize this in the future.

Searching

The way you connect to Elasticsearch is simply by a URL it looks something like this:

https://USER:PASSWD@HASH.us-west-2.aws.found.io:9243

It's an Elasticsearch cluster managed by Elastic running inside AWS. Our job is to make sure that we put the exact same URL in our GitHub Action ("the writer") as we put it into our Django server ("the reader").
In fact, we have 3 Elastic clusters: Prod, Stage, Dev.
And we have 2 Django servers: Prod, Stage.
So we just need to carefully make sure the secrets are set correctly to match the right environment.

Now, in the Django server, we just need to convert a request like GET /api/v1/search?q=foo&locale=fr (for example) to a query to send to Elasticsearch. We have a simple Django view function that validates the query string parameters, does some rate-limiting, creates a query (using elasticsearch-dsl) and packages the Elasticsearch results back to JSON.

How we make that query is important. In here lies the most important feature of the search; how it sorts results.

In one simple explanation, the sort order is a combination of popularity and "matchness". The assumption is that most people want the popular content. I.e. they search for foreach and mean to go to /en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/forEach not /en-US/docs/Web/API/NodeList/forEach both of which contains forEach in the title. The "popularity" is based on Google Analytics pageviews which we download periodically, normalize into a floating-point number between 1 and 0. At the of writing the scoring function does something like this:

rank = doc.popularity * 10 + search.score

This seems to produce pretty reasonable results.

But there's more to the "matchness" too. Elasticsearch has its own API for defining boosting and the way we apply is:

match phrase in the title: Boost = 10.0
match phrase in the body: Boost = 5.0
match in title: Boost = 2.0
match in body: Boost = 1.0

This is then applied on top of whatever else Elasticsearch does such as "Term Frequency" and "Inverse Document Frequency" (tf and if). This article is a helpful introduction.

We're most likely not done with this. There's probably a lot more we can do to tune this myriad of knobs and sliders to get the best possible ranking of documents that match.

Web UI

The last piece of the puzzle is how we display all of this to the user. The way it works is that developer.mozilla.org/$locale/search returns a static page that is blank. As soon as the page has loaded, it lazy-loads JavaScript that can actually issue the XHR request to get and display search results. The code looks something like this:


function SearchResults() {
  const [searchParams] = useSearchParams();
  const sp = createSearchParams(searchParams);
  // add defaults and stuff here
  const fetchURL = `/api/v1/search?${sp.toString()}`;

  const { data, error } = useSWR(
    fetchURL,
    async (url) => {
      const response = await fetch(URL);
      // various checks on the response.statusCode here
      return await response.json();
    }
  );

  // render 'data' or 'error' accordingly here

A lot of interesting details are omitted from this code snippet. You have to check it out for yourself to get a more up-to-date insight into how it actually works. But basically, the window.location (and pushState) query string drives the fetch() call and then all the component has to do is display the search results with some highlighting.

The /api/v1/search endpoint also runs a suggestion query as part of the main search query. This extracts out interest alternative search queries. These are filtered and scored and we issue "sub-queries" just to get a count for each. Now we can do one of those "Did you mean...". For example: search for intersections.

In conclusion

There are a lot of interesting, important, and careful details that are glossed over here in this blog post. It's a constantly evolving system and we're constantly trying to improve and perfect the system in a way that it fits what users expect.

A lot of people reach MDN via a Google search (e.g. mdn array foreach) but despite that, nearly 5% of all traffic on MDN is the site-search functionality. The /$locale/search?... endpoint is the most frequently viewed page of all of MDN. And having a good search engine that's reliable is nevertheless important. By owning and controlling the whole pipeline allows us to do specific things that are unique to MDN that other websites don't need. For example, we index a lot of raw HTML (e.g. <video>) and we have code snippets that needs to be searchable.

Hopefully, the MDN site-search will elevate from being known to be very limited to something now that can genuinely help people get to the exact page better than Google can. Yes, it's worth aiming high!

Fastest way to turn HTML into text in Python

January 8, 2021
4 comments Python

tl;dr; selectolax is best for stripping HTML down to plain text.

The problem is that I have 10,000+ HTML snippets that I need to index into Elasticsearch as plain text. (Before you ask, yes I know Elasticsearch has a html_strip text filter but it's not what I want/need to use in this context).
Turns out, stripping the HTML into plain text was actually quite expensive at that scale. So what's the most performant way?

PyQuery


from pyquery import PyQuery as pq

text = pq(html).text()

selectolax


from selectolax.parser import HTMLParser

text = HTMLParser(html).text()

regular expression


import re

regex = re.compile(r'<.*?>')
text = clean_regex.sub('', html)

Results

I wrote a script that iterated through 10,000 files that contains HTML snippets. Note! The snippets aren't complete <html> documents (with a <head> and <body> etc) Just blobs of HTML. The average size is 10,314 bytes (5,138 bytes median).

pyquery
  SUM:    18.61 seconds
  MEAN:   1.8633 ms
  MEDIAN: 1.0554 ms
selectolax
  SUM:    3.08 seconds
  MEAN:   0.3149 ms
  MEDIAN: 0.1621 ms
regex
  SUM:    1.64 seconds
  MEAN:   0.1613 ms
  MEDIAN: 0.0881 ms

I've run it a bunch of times. The results are pretty stable.

Point is: selectolax is ~7 times faster than PyQuery

Regex? Really?

No, I don't think I want to use that. It makes me nervous without even attempting to dig up some examples where it goes wrong. It might work just fine for the most basic blobs of HTML. Actually, if the HTML is <p>Foo & Bar</p>, I expect the plain text transformation should be Foo & Bar, not Foo & Bar.

More pressing, both PyQuery and selectolax supports something very specific but important to my use case. I need to remove certain tags (and its content) before I proceed. For example:


<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<div style="display: none">This should also get stripped.</div>

That can never be done with a regex.

Version 2.0

So my requirement will probably change but basically, I want to delete certain tags. E.g. <div class="warning"> and <div class="hidden"> and <div style="display: none">. So let's implement that:

PyQuery


from pyquery import PyQuery as pq

_display_none_regex = re.compile(r'display:\s*none')

doc = pq(html)
doc.remove('div.warning, div.hidden')
for div in doc('div[style]').items():
    style_value = div.attr('style')
    if _display_none_regex.search(style_value):
        div.remove()
text = doc.text()

selectolax


from selectolax.parser import HTMLParser

_display_none_regex = re.compile(r'display:\s*none')

tree = HTMLParser(html)
for tag in tree.css('div.warning, div.hidden'):
    tag.decompose()
for tag in tree.css('div[style]'):
    style_value = tag.attributes['style']
    if style_value and _display_none_regex.search(style_value):
        tag.decompose()
text = tree.body.text()

This actually works. When I now run the same benchmark for 10,000 of these are the new results:

pyquery
  SUM:    21.70 seconds
  MEAN:   2.1701 ms
  MEDIAN: 1.3989 ms
selectolax
  SUM:    3.59 seconds
  MEAN:   0.3589 ms
  MEDIAN: 0.2184 ms
regex
  Skip

Again, selectolax beats PyQuery by a factor of ~6.

Conclusion

Regular expressions are fast but weak in power. Makes sense.

This selectolax is very impressive.
I got the inspiration from this blog post which sets out to do something very similar to what I'm doing.

I hope this helps someone. Thank you Artem Golubin of selectolax and @lexborisov for Modest which selectolax is built upon.