A blog and website by Peter Bengtsson
Today I stumbled across a neat CLI for benchmark comparing CLIs for speed: hyperfine. By David @sharkdp Peter.
It's a great tool in your arsenal for quick benchmarks in the terminal.
It's written in Rust and is easily installed with brew install hyperfine
. For example, let's compare a couple of different commands for compressing a file into a new compressed file. I know it's comparing apples and oranges but it's just an example:
It basically executes the following commands over and over and then compares how long each one took on average:
apack log.log.apack.gz log.log
gzip -k log.log
zstd log.log
brotli -3 log.log
If you're curious about the ~results~ apples vs oranges, the final result is:
▶ ls -lSh log.log* -rw-r--r-- 1 peterbe staff 25M Jul 3 10:39 log.log -rw-r--r-- 1 peterbe staff 2.4M Jul 5 22:00 log.log.apack.gz -rw-r--r-- 1 peterbe staff 2.4M Jul 3 10:39 log.log.gz -rw-r--r-- 1 peterbe staff 2.2M Jul 3 10:39 log.log.zst -rw-r--r-- 1 peterbe staff 2.1M Jul 3 10:39 log.log.br
The point is that you type hyperfine
followed by each command in quotation marks. The --prepare
is run for each command and you can also use --cleanup="{cleanup command here}
.
It's versatile so it doesn't have to be different commands but it can be: hyperfine "python optimization1.py" "python optimization2.py"
to compare to Python scripts.
🎵 You can also export the output to a Markdown file. Here, I used:
▶ hyperfine "apack log.log.apack.gz log.log" "gzip -k log.log" "zstd log.log" "brotli -3 log.log" --prepare="rm -fr log.log.*" --export-markdown log.compress.md ▶ cat log.compress.md | pbcopy
and it becomes this:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
apack log.log.apack.gz log.log |
291.9 ± 7.2 | 283.8 | 304.1 | 4.90 ± 0.19 |
gzip -k log.log |
240.4 ± 7.3 | 232.2 | 256.5 | 4.03 ± 0.18 |
zstd log.log |
59.6 ± 1.8 | 55.8 | 65.5 | 1.00 |
brotli -3 log.log |
122.8 ± 4.1 | 117.3 | 132.4 | 2.06 ± 0.09 |
tl;dr
- name: Only if auto-merge is enabled
if: ${{ github.event.pull_request.auto_merge }}
run: echo "Auto-merge IS ENABLED"
- name: Only if auto-merge is NOT enabled
if: ${{ !github.event.pull_request.auto_merge }}
run: echo "Auto-merge is NOT enabled"
The use case that I needed was that I have a workflow that does a bunch of things that aren't really critical to test the PR, but they also take a long time. In particular, every pull request deploys a "preview environment" so you get a "staging" site for each pull request. Well, if you know with confidence that you're not going to be clicking around on that preview/staging site, why bother deploying it (again)?
Also, a lot of PRs get the "Auto-merge" enabled because whoever pressed that button knows that as long as it builds OK, it's ready to merge in.
What's cool about the if:
statements above is that they will work in all of these cases too:
on:
workflow_dispatch:
pull_request:
push:
branches:
- main
I.e. if this runs because it was a push to main
the line ${{ !github.event.pull_request.auto_merge }}
will resolve to truthy. Same if you use the workflow dispatch from workflow_dispatch
.
Auto-merge is a fantastic GitHub Actions feature. You first need to set up some branch protections and then, as soon as you've created the PR you can press the "Enable auto-merge (squash)". It will ("Squash and merge") merge the PR as soon as all branch protection checks succeeded. Neat.
But what if you have a workflow that is made up of half critical and half not-so-important stuff. In particular, what if there's stuff in the workflow that is really slow and you don't want to wait. One example is that you might have a build-and-deploy workflow where you've decided that the "build" part of that is a required check, but the (slow) deployment is just a nice-to-have. Here's an example of that:
name: Build and Deploy stuff
on:
workflow_dispatch:
pull_request:
permissions:
contents: read
jobs:
build-stuff:
runs-on: ubuntu-latest
steps:
- name: Slight delay
run: sleep 5
deploy-stuff:
needs: build-stuff
runs-on: ubuntu-latest
steps:
- name: Do something
run: sleep 26
It's a bit artificial but perhaps you can see beyond that. What you can do is set up a required status check, as a branch protection, just for the build-stuff
job.
Note how the job is made up of build-stuff
and deploy-stuff
, where the latter depends on the first. Now set up branch protection purely based on the build-stuff
. This option should appear as you start typing buil
there in the "Status checks that are required." section of Branch protections.
Now, when the PR is created it immediately starts working on that build-stuff
job. While that's running you press the "Enable auto-merge (squash)" button:
What will happen is that as soon as the build-stuff
job (technically the full name becomes "Build and Deploy stuff / build-stuff") goes green, the PR is auto-merged. But the next (dependent) job deploy-stuff
now starts so even if the PR is merged you still have an ongoing workflow job running. Note the little orange dot (instead of the green checkmark).
It's quite an advanced pattern and perhaps you don't have the use case yet, but it's good to know it's possible. What our use case at work was, was that we use auto-merge a lot in automation and our complete workflow depended on a slow step that is actually conditional (and a bit slow). So we didn't want the auto-merge to be delayed because of something that might be slow and might also turn out to not be necessary.
Imagine you have something like this in Django:
class MyModel(models.Models):
last_name = models.CharField(max_length=255, blank=True)
...
The most basic sorting is either: queryset.order_by('last_name')
or queryset.order_by('-last_name')
. But what if you want entries with a blank string last? And, you want it to be case insensitive. Here's how you do it:
from django.db.models.functions import Lower, NullIf
from django.db.models import Value
if reverse:
order_by = Lower("last_name").desc()
else:
order_by = Lower(NullIf("last_name", Value("")), nulls_last=True)
ALL = list(queryset.values_list("last_name", flat=True))
print("FIRST 5:", ALL[:5])
# Will print either...
# FIRST 5: ['Zuniga', 'Zukauskas', 'Zuccala', 'Zoller', 'ZM']
# or
# FIRST 5: ['A', 'aaa', 'Abrams', 'Abro', 'Absher']
print("LAST 5:", ALL[-5:])
# Will print...
# LAST 5: ['', '', '', '', '']
This is only tested with PostgreSQL but it works nicely.
If you're curious about what the SQL becomes, it's:
SELECT "main_contact"."last_name" FROM "main_contact"
ORDER BY LOWER(NULLIF("main_contact"."last_name", '')) ASC
or
SELECT "main_contact"."last_name" FROM "main_contact"
ORDER BY LOWER("main_contact"."last_name") DESC
Note that if your table columns is either a string, an empty string, or null, the reverse needs to be: Lower("last_name", nulls_last=True).desc()
.
Does you server barf if your clients close the connection before it's fully downloaded? Well, there's an easy way to find out. You can use this Python script:
import sys
import requests
url = sys.argv[1]
assert '://' in url, url
r = requests.get(url, stream=True)
if r.encoding is None:
r.encoding = 'utf-8'
for chunk in r.iter_content(1024, decode_unicode=True):
break
I use the xh
CLI tool a lot. It's like curl
but better in some things. By default, if you use --headers
it will make a regular GET
request but close the connection as soon as it has gotten all the headers. E.g.
▶ xh --headers https://www.peterbe.com HTTP/2.0 200 OK cache-control: public,max-age=3600 content-type: text/html; charset=utf-8 date: Wed, 30 Mar 2022 12:37:09 GMT etag: "3f336-Rohm58s5+atf5Qvr04kmrx44iFs" server: keycdn-engine strict-transport-security: max-age=63072000; includeSubdomains; preload vary: Accept-Encoding x-cache: HIT x-content-type-options: nosniff x-edge-location: usat x-frame-options: SAMEORIGIN x-middleware-cache: hit x-powered-by: Express x-shield: active x-xss-protection: 1; mode=block
That's not be confused with doing HEAD
like curl -I ...
.
So either with xh
or the Python script above, you can get that same effect. It's a useful trick when you want to make sure your (async) server doesn't attempt to do weird stuff with the "Response" object after the connection has closed.
Web development, GitHub, JavaScript
tl;dr; docsQL is a web app for analyzing lots of Markdown content files with SQL queries.
Sample instance based on MDN's open source content.
When I worked on the code for MDN in 2019-2021 I often found that I needed to understand the content better to debug or test or just find a sample page that uses some feature. I ended up writing a lot of one-off Python scripts that would traverse the repository files just to do some quick lookup that was too complex for grep
. Eventually, I built a prototype called "Traits DB" which was powered by an in-browser SQL engine called alasql
. Then in 2021, I joined GitHub to work on GitHub Docs and here there are lots of Markdown files too that trigger different features based on various front-matter keys.
docsQL does two things:
.md
files into a docs.json
file which can be queried The analyzing portion has a killer feature in that you can write your own plugins tailored specifically to your project. Your project might use some quirks that are unique. In GitHub Docs, for example, we use something called "LiquidJS" which is like a pre-Markdown processing to do things like versioning. So I can write a custom JavaScript plugin that extends data you get from reading in the front-matter.
Here's an example plugin:
const regex = /💩/g;
export default function countCocoIceMentions({ data, content }) {
const inTitle = (data.title.match(regex) || []).length;
const inBody = (content.match(regex) || []).length;
return {
chocolateIcecreamMentions: inTitle + inBody,
};
}
Now, if you add that to your project, you'll be able to run:
SELECT title, chocolateIcecreamMentions FROM ?
WHERE chocolateIcecreamMentions > 0
ORDER BY 2 DESC LIMIT 15
It's up to you. One important fact to keep in mind is that not everyone speaks SQL fluently. And even if you're somewhat confident with SQL, it might not be obvious how this particular engine works or what the fields are. (Mind you, there's a "Help" which shows you all fields and a collection of sample queries).
But it's really intuitive to extend an already written SQL query. So if someone shares their query, it's easy to just extend it. For example, your colleague might share a URL with an SQL query in the query string, but you want to change the sort order so you just edit DESC
for ASC
.
I would recommend that any team that has a project with a bunch of Markdown files, add docsql
as a dependency somewhere, have it build with your directory of Markdown files, and then publish the docsql/out/
directory as a static web page which you can host on Netlify or GitHub Pages.
This way, your team gets a centralized place where team members can share URLs with each other that has queries in it. When someone shares one of these, they get added to your "Saved queries" and you can extend them from there to add to your own list.
The project is here: github.com/peterbe/docsql and it's MIT licensed. The analyzing part is all Node. It's a CLI that is able to dynamically import other .mjs
files based on scanning the directory at runtime.
The front-end is a NextJS static build which uses Mantine for the React UI components.
You can install it npx
like this:
npx docsql /path/to/my/markdown/files
But if you want to control it a bit better you can simply add it to your own Node project with: npm save docsql
or yarn add docsql
.
First of all, it's a very new project. My initial goal was to get the basics working. A lot of edges have been left rough. Especially in areas of installation, performance, and SQL editor. Please come and help out if you see something. In particular, if you tried to set it up but found it hard, we can work together to either improve the documentation to fix some scripts that would help the next person.
For feature requests and bug reports use: https://github.com/peterbe/docsql/issues/new
Or just comment here on the blog post.