Peterbe.com

A blog and website by Peter Bengtsson

How much faster is Cheerio at parsing depending on xmlMode?

Node, JavaScript

Cheerio is a fantastic Node library for parsing HTML and then being able to manipulate and serialize it. But you can also just use it for parsing HTML and plucking out what you need. We use that to prepare the text that goes into our search index for our site. It basically works like this:

const body = await getBody('http://localhost:4002' + eachPage.path)
const $ = cheerio.load(body)
const title = $('h1').text()
const intro = $('p.intro').text()
...

But it hit me, can we speed that up? cheerio actually ships with two different parsers:

  1. parse5
  2. htmlparser2

One is faster and one is more strict.
But I wanted to see this in a real-world example.

So I made two runs where I used:

const $ = cheerio.load(body)

in one run, and:

const $ = cheerio.load(body, { xmlMode: true })

in another.

After having parsed 1,635 pages of HTML of various sizes the results are:

FILE: load.txt
MEAN:   13.19457640586797
MEDIAN: 10.5975

FILE: load-xmlmode.txt
MEAN:   3.9020372860635697
MEDIAN: 3.1020000000000003

So, using {xmlMode:true} leads to roughly a 3x speedup.

I think it pretty much confirms the original benchmark, but now I know based on a real application.

Please post a comment if you have thoughts or questions.

Programmatically control the matrix in a GitHub Action workflow

GitHub

If you've used GitHub Actions before you might be familiar with the matrix strategy. For example:

name: My workflow

jobs:
  build:
    strategy:
      matrix:
        version: [10, 12, 14, 16, 18]
    steps:
      - name: Set up Node ${{ matrix.node }}
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node }}
      ...

But what if you want that list of things in the matrix to be variable? For example, on rainy days you want it to be [10, 12, 14] and on sunny days you want it to be [14, 16, 18]. Or, more seriously, what if you want it to depend on how the workflow is started?

Let's explain this with a scoped example

You can make a workflow run on a schedule, on pull requests, on pushes, on manual "Run workflow", or as a result on some other workflow finishing.

First, let's set up some sample on directives:

name: My workflow

on:
  workflow_dispatch:
  schedule:
    - cron: '*/5 * * * *'
  workflow_run:
    workflows: ['Build and Deploy stuff']
    types:
      - completed

The workflow_dispatch makes it so that a button like this appears:

Run workflow

The schedule, in this example, means "At every 5th minute"

And workflow_run, in this example, means that it waits for another workflow, in the same repo, with name: 'Build and Deploy stuff' has finished (but not necessarily successfully)

Let's define some choice business logic

For the sake of the demo, let's say this is the rule:

  1. If the workflow runs because of the schedule, you want the matrix to be [16, 18].
  2. If the workflow runs because of the "Run workflow" button press, you want the matrix to be [18].
  3. If the workflow runs because of the Build and Deploy stuff workflow has successfully finished, you want the matrix to be [10, 12, 14, 16, 18].

It's arbitrary but it could be a lot more complex than this.

What's also important to appreciate is that you could use individual steps that look something like this:

  - steps:
     - name: Only if started on a workflow_dispatch
        if: ${{ github.event_name == 'workflow_dispatch' }}
        run: echo "yes it was run because of a workflow_dispatch"

But the rest of the workflow is realistically a lot more complex with many steps and you don't want to have to sprinkle the line if: ${{ github.event_name == 'workflow_dispatch' }} into every single step.

The solution to avoiding repetition is to use a job that depends on another job. We'll have a job that figures out the array for the matrix and another job that uses that.

Let's write the business logic in JavaScript

First we inject a job that looks like this:

jobs:
  matrix_maker:
    runs-on: ubuntu-latest
    outputs:
      matrix: ${{ steps.set-matrix.outputs.result }}
    steps:
      - uses: actions/github-script@v6
        id: set-matrix
        with:
          script: |
            if (context.eventName === "workflow_dispatch") {
              return [18]
            }
            if (context.eventName === "schedule") {
              return [16, 18]
            }
            if (context.eventName === "workflow_run") {
              if (context.payload.workflow_run.conclusion === "success") {
                return [10, 12, 14, 16, 18]
              }
              throw new Error(`It was a workflow_run but not success ('${context.payload.workflow_run.conclusion}')`)
            }
            throw new Error("Unable to find a reason")

      - name: Debug output
        run: echo "${{ steps.set-matrix.outputs.result }}"

Now we can write the "meat" of the workflow that uses this output:

  build:
    needs: matrix_maker
    strategy:
      matrix:
        version: ${{ fromJSON(needs.matrix_maker.outputs.matrix) }}
    steps:
      - name: Set up Node ${{ matrix.version }}
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.version }}

Combined, the entire thing can look like this:

name: My workflow

on:
  workflow_dispatch:
  schedule:
    - cron: '*/5 * * * *'
  workflow_run:
    workflows: ['Build and Deploy stuff']
    types:
      - completed

jobs:
  matrix_maker:
    runs-on: ubuntu-latest
    outputs:
      matrix: ${{ steps.set-matrix.outputs.result }}
    steps:
      - uses: actions/github-script@v6
        id: set-matrix
        with:
          script: |
            if (context.eventName === "workflow_dispatch") {
              return [18]
            }
            if (context.eventName === "schedule") {
              return [16, 18]
            }
            if (context.eventName === "workflow_run") {
              if (context.payload.workflow_run.conclusion === "success") {
                return [10, 12, 14, 16, 18]
              }
              throw new Error(`It was a workflow_run but not success ('${context.payload.workflow_run.conclusion}')`)
            }
            throw new Error("Unable to find a reason")

      - name: Debug output
        run: echo "${{ steps.set-matrix.outputs.result }}"

  build:
    needs: matrix_maker
    strategy:
      matrix:
        version: ${{ fromJSON(needs.matrix_maker.outputs.matrix) }}
    steps:
      - name: Set up Node ${{ matrix.version }}
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.version }}

Conclusion

I've extrapolated this demo from a more complex one at work. (this is my defense for typos and why it might fail if you verbatim copy-n-paste this). The bare bones are there for you to build on.

In this demo, I've used actions/github-script with JavaScript, because it's convenient and you don't need do to things like actions/checkout and npm ci if you want this to be a standalone Node script. Hopefully you can see that this is just a start and the sky's the limit.

Thanks to fellow GitHub Hubber @joshmgross for the tips and help!

Also, check out Tips and tricks to make you a GitHub Actions power-user

Please post a comment if you have thoughts or questions.

First impressions trying out Rome to format/lint my TypeScript and JavaScript

Node, JavaScript

Rome is a new contender to compete with Prettier and eslint, combined. It's fast and its suggestions are much easier to understand.

I have a project that uses .js, .ts, and .tsx files. At first, I thought, I'd just use rome to do formatting but the linter part was feeling nice as I was experimenting so I thought I'd kill two birds with one stone.

Things that worked well

It is fast

My little project only has 28 files, but time rome check lib scripts components *.ts consistently takes 0.08 seconds.

The CLI looks great

You get this nice prompt after running npx rome init the first time:

rome init

Suggestions just look great

Easy to understand and needs no explanation because the suggested fix tells a story that means it's immediately easy to understand what the warning is trying to say.

suggestion

It is smaller

If I run npx create-next-app@latest, say yes to Eslint, and then run npm I -D prettier, the node_modules becomes 275.3 MiB.
Whereas if I run npx create-next-app@latest, say no to Eslint, and then run npm I -D rome, the node_modules becomes 200.4 MiB.

Editing the rome.json's JSON schema works in VS Code

I don't know how this magically worked, but I'm guessing it just does when you install the Rome VS Code extension. Neat with autocomplete!

editing the rome.json file

Things that didn't work so well

Almost all things that I'm going to "complain" about is down to usability. I might look back at this in a year (or tomorrow!) and laugh at myself for being dim, but it nevertheless was part of my experience so it's worth pointing out.

Lint, check, or format?

It's confusing what is what. If lint means checking without modifying, what is check then? I'm guessing rome format means run the lint but with permission to edit my files.

What is rome format compared to rome check --apply then??

I guess rome check --apply doesn't just complain but actually applies the things it spots. So what is rome check --apply-suggested?? (if you're reading this and feel eager to educate me with a comment, please do, but I'm trying to point out that it's not user-friendly)

How do I specify wildcards?

Unfortunately, in this project, not all files are in one single directory (e.g. rome check src/ is not an option). How do I specify a wildcard expression?

▶ rome check *.ts
Checked 3 files in 942µs

Cool, but how do I do all .ts files throughout the project?

▶ rome check "**/*.ts"
**/*.ts internalError/io ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  ✖ No such file or directory (os error 2)


Checked 0 files in 66µs

Clearly, it's not this:

▶ rome check **/*.ts

...

The number of diagnostics exceeds the number allowed by Rome.
Diagnostics not shown: 1018.
Checked 2534 files in 1387ms
Skipped 1 files
Error: errors where emitted while running checks

...because bash will include all the files from node_modules/**/*.ts.

In the end, I ended up with this (in my package.json):

"scripts": {
    "code:lint": "rome check lib scripts components *.ts",
    ...

There's no documentation about how to ignore certain rules

Yes, I can contribute this back to the documentation, but today's not the day to do that.

It took me a long time to find out how to disable certain rules (in the rome.json file) and finally I landed on this:

{
  "linter": {
    "enabled": true,
    "rules": {
      "recommended": true,
      "style": {
        "recommended": true,
        "noImplicitBoolean": "off"
      },
      "a11y": {
        "useKeyWithClickEvents": "off",
        "useValidAnchor": "warn"
      }
    }
  }
}

Much better than having to write inline code comments with the source files themselves.

However, it's still not clear to me what "recommended": true means. Is it shorthand for listing all the default rules all set to true? If I remove that, are no rules activated?

The rome.json file is JSON

JSON is cool for many things, but writing comments is not one of them.

For example, I don't know what would be better, Yaml or Toml, but it would be nice to write something like:

"a11y": {
    # Disabled because of issue #1234
    # Consider putting this back in December after the refactor launch
    "useKeyWithClickEvents": "off",

Nextjs and rome needs to talk

When create-react-app first came onto the scene, the coolest thing was the zero-config webpack. But, if you remember, it also came with a really nice zero-config eslint configuration for React apps. It would even print warnings when the dev server was running. Now it's many years later and good linting config is something you depend/rely on in a framework. Like it or not, there are specific things in Nextjs that is exclusive to that framework. It's obviously not an easy people-problem to solve but it would be nice if Nextjs and rome could be best friends so you get all the good linting ideas from the code Nextjs framework but all done using rome instead.

Please post a comment if you have thoughts or questions.

How to count the most common lines in a file

Bash, MacOSX, Linux

tl;dr sort myfile.log | uniq -c | sort -n -r

I wanted to count recurring lines in a log file and started writing a complicated Python script but then wondered if I can just do it with bash basics.
And after some poking and experimenting I found a really simple one-liner that I'm going to try to remember for next time:

You can't argue with the nice results :)

▶ cat myfile.log
one
two
three
one
two
one
once
one

▶ sort myfile.log | uniq -c | sort -n -r
   4 one
   2 two
   1 three
   1 once

Please post a comment if you have thoughts or questions.

Find the largest node_modules directories with bash

Bash, MacOSX, Linux

tl;dr; fd -I -t d node_modules | rg -v 'node_modules/(\w|@)' | xargs du -sh | sort -hr

It's very possible that there's a tool that does this, but if so please enlighten me.
The objective is to find which of all your various projects' node_modules directory is eating up the most disk space.
The challenge is that often you have nested node_modules within and they shouldn't be included.

The command uses fd which comes from brew install fd and it's a fast alternative to the built-in find. Definitely worth investing in if you like to live fast on the command line.
The other important command here is rg which comes from brew install ripgrep and is a fast alternative to built-in grep. Sure, I think one can use find and grep but that can be left as an exercise to the reader.

▶ fd -I -t d node_modules | rg -v 'node_modules/(\w|@)' | xargs du -sh | sort -hr
1.1G    ./GROCER/groce/node_modules/
1.0G    ./SHOULDWATCH/youshouldwatch/node_modules/
826M    ./PETERBECOM/django-peterbecom/adminui/node_modules/
679M    ./JAVASCRIPT/wmr/node_modules/
546M    ./WORKON/workon-fire/node_modules/
539M    ./PETERBECOM/chiveproxy/node_modules/
506M    ./JAVASCRIPT/minimalcss-website/node_modules/
491M    ./WORKON/workon/node_modules/
457M    ./JAVASCRIPT/battleshits/node_modules/
445M    ./GITHUB/DOCS/docs-internal/node_modules/
431M    ./GITHUB/DOCS/docs/node_modules/
418M    ./PETERBECOM/preact-cli-peterbecom/node_modules/
418M    ./PETERBECOM/django-peterbecom/adminui0/node_modules/
399M    ./GITHUB/THEHUB/thehub/node_modules/
...

How it works:

  • fd -I -t d node_modules: Find all directories called node_modules but ignore any .gitignore directives in their parent directories.
  • rg -v 'node_modules/(\w|@)': Exclude all finds where the word node_modules/ is followed by a @ or a [a-z0-9] character.
  • xargs du -sh: For each line, run du -sh on it. That's like doing cd some/directory && du -sh, where du means "disk usage" and -s means total and -h means human-readable.
  • sort -hr: Sort by the first column as a "human numeric sort" meaning it understands that "1M" is more than "20K"

Now, if I want to free up some disk space, I can look through the list and if I recognize a project I almost never work on any more, I just send it to rm -fr.

Please post a comment if you have thoughts or questions.

Spot the JavaScript bug with recursion and incrementing

JavaScript

What will this print?

function doSomething(iterations = 0) {
  if (iterations < 10) {
    console.log("Let's do this again!")
    doSomething(iterations++)    
  }
}
doSomething()

The answer is it will print

Let's do this again!
Let's do this again!
Let's do this again!
Let's do this again!
Let's do this again!
Let's do this again!
Let's do this again!
Let's do this again!
...forever...

The bug is the use of a "postfix increment" which is a bug I had in some production code (almost, it never shipped).

The solution is simple:

     console.log("Let's do this again!")
-    doSomething(iterations++)
+    doSomething(++iterations)    

That's called "prefix increment" which means it not only changes the variable but returns what the value became rather than what it was before increment.

The beautiful solution is actually the simplest solution:

     console.log("Let's do this again!")
-    doSomething(iterations++)
+    doSomething(iterations + 1)    

Now, you don't even mutate the value of the iterations variable but create a new one for the recursion call.

All in all, pretty simple mistake but it can easily happen. Particular if you feel inclined to look cool by using the spiffy ++ shorthand because it looks neater or something.

Please post a comment if you have thoughts or questions.