Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page!
Currently only showing blog entries under thecategories: JavaScript, Python. Clear filter

How to close a HTTP GET request in Python before the end

Python

Does you server barf if your clients close the connection before it's fully downloaded? Well, there's an easy way to find out. You can use this Python script:

import sys
import requests

url = sys.argv[1]
assert '://' in url, url
r = requests.get(url, stream=True)
if r.encoding is None:
    r.encoding = 'utf-8'
for chunk in r.iter_content(1024, decode_unicode=True):
    break

I use the xh CLI tool a lot. It's like curl but better in some things. By default, if you use --headers it will make a regular GET request but close the connection as soon as it has gotten all the headers. E.g.

ā–¶ xh --headers https://www.peterbe.com
HTTP/2.0 200 OK
cache-control: public,max-age=3600
content-type: text/html; charset=utf-8
date: Wed, 30 Mar 2022 12:37:09 GMT
etag: "3f336-Rohm58s5+atf5Qvr04kmrx44iFs"
server: keycdn-engine
strict-transport-security: max-age=63072000; includeSubdomains; preload
vary: Accept-Encoding
x-cache: HIT
x-content-type-options: nosniff
x-edge-location: usat
x-frame-options: SAMEORIGIN
x-middleware-cache: hit
x-powered-by: Express
x-shield: active
x-xss-protection: 1; mode=block

That's not be confused with doing HEAD like curl -I ....

So either with xh or the Python script above, you can get that same effect. It's a useful trick when you want to make sure your (async) server doesn't attempt to do weird stuff with the "Response" object after the connection has closed.

Please post a comment if you have thoughts or questions.

Introducing docsQL

Web development, GitHub, JavaScript

https://github.com/peterbe/docsql

tl;dr; docsQL is a web app for analyzing lots of Markdown content files with SQL queries.

Demo

Sample instance based on MDN's open source content.

Screenshots

Background/Introduction

When I worked on the code for MDN in 2019-2021 I often found that I needed to understand the content better to debug or test or just find a sample page that uses some feature. I ended up writing a lot of one-off Python scripts that would traverse the repository files just to do some quick lookup that was too complex for grep. Eventually, I built a prototype called "Traits DB" which was powered by an in-browser SQL engine called alasql. Then in 2021, I joined GitHub to work on GitHub Docs and here there are lots of Markdown files too that trigger different features based on various front-matter keys.

docsQL does two things:

  1. Analyze lots of .md files into a docs.json file which can be queried
  2. A static single-page-app for executing SQL against this database file

Plugins

The analyzing portion has a killer feature in that you can write your own plugins tailored specifically to your project. Your project might use some quirks that are unique. In GitHub Docs, for example, we use something called "LiquidJS" which is like a pre-Markdown processing to do things like versioning. So I can write a custom JavaScript plugin that extends data you get from reading in the front-matter.

Here's an example plugin:

const regex = /šŸ’©/g;
export default function countCocoIceMentions({ data, content }) {
  const inTitle = (data.title.match(regex) || []).length;
  const inBody = (content.match(regex) || []).length;
  return {
    chocolateIcecreamMentions: inTitle + inBody,
  };
}

Now, if you add that to your project, you'll be able to run:

SELECT title, chocolateIcecreamMentions FROM ? 
WHERE chocolateIcecreamMentions > 0 
ORDER BY 2 DESC LIMIT 15

How you're supposed to use it

It's up to you. One important fact to keep in mind is that not everyone speaks SQL fluently. And even if you're somewhat confident with SQL, it might not be obvious how this particular engine works or what the fields are. (Mind you, there's a "Help" which shows you all fields and a collection of sample queries).
But it's really intuitive to extend an already written SQL query. So if someone shares their query, it's easy to just extend it. For example, your colleague might share a URL with an SQL query in the query string, but you want to change the sort order so you just edit DESC for ASC.

I would recommend that any team that has a project with a bunch of Markdown files, add docsql as a dependency somewhere, have it build with your directory of Markdown files, and then publish the docsql/out/ directory as a static web page which you can host on Netlify or GitHub Pages.
This way, your team gets a centralized place where team members can share URLs with each other that has queries in it. When someone shares one of these, they get added to your "Saved queries" and you can extend them from there to add to your own list.

Behind the scenes

The project is here: github.com/peterbe/docsql and it's MIT licensed. The analyzing part is all Node. It's a CLI that is able to dynamically import other .mjs files based on scanning the directory at runtime.

The front-end is a NextJS static build which uses Mantine for the React UI components.

You can install it npx like this:

npx docsql /path/to/my/markdown/files

But if you want to control it a bit better you can simply add it to your own Node project with: npm save docsql or yarn add docsql.

Let's make it better

First of all, it's a very new project. My initial goal was to get the basics working. A lot of edges have been left rough. Especially in areas of installation, performance, and SQL editor. Please come and help out if you see something. In particular, if you tried to set it up but found it hard, we can work together to either improve the documentation to fix some scripts that would help the next person.

For feature requests and bug reports use: https://github.com/peterbe/docsql/issues/new
Or just comment here on the blog post.

Please post a comment if you have thoughts or questions.

Make your NextJS site 10-100x faster with Express caching

React, Node, Nginx, JavaScript

https://github.com/peterbe/next-peterbecom/blob/main/middleware/render-caching.mjs

UPDATE: Feb 21, 2022: The original blog post didn't mention the caching of custom headers. So warm cache hits would lose Cache-Control from the cold cache misses. Code updated below.

I know I know. The title sounds ridiculous. But it's not untrue. I managed to make my NextJS 20x faster by allowing the Express server, which handles NextJS, to cache the output in memory. And cache invalidation is not a problem.

Layers

My personal blog is a stack of layers:

KeyCDN --> Nginx (on my server) -> Express (same server) -> NextJS (inside Express)

And inside the NextJS code, to get the actual data, it uses HTTP to talk to a local Django server to get JSON based on data stored in a PostgreSQL database.

The problems I have are as follows:

  • The CDN sometimes asks for the same URL more than once when in theory you'd think it should be cached by them for a week. And if the traffic is high, my backend might get a stamping herd of requests until the CDN has warmed up.
  • It's technically possible to bypass the CDN by going straight to the origin server.
  • NextJS is "slow" and the culprit is actually critters which computes the critical CSS inline and lazy-loads the rest.
  • Using Nginx to do in-memory caching (which is powerfully fast by the way) does not allow cache purging at all (unless you buy Nginx Plus)

I really like NextJS and it's a great developer experience. There are definitely many things I don't like about it, but that's more because my site isn't SPA'y enough to benefit from much of what NextJS has to offer. By the way, I blogged about rewriting my site in NextJS last year.

Quick detour about critters

If you're reading my blog right now in a desktop browser, right-click and view source and you'll find this:

<head>
  <style>
  *,:after,:before{box-sizing:inherit}html{box-sizing:border-box}inpu...
  ... about 19k of inline CSS...
  </style>
  <link rel="stylesheet" href="/_next/static/css/fdcd47c7ff7e10df.css" data-n-g="" media="print" onload="this.media='all'">
  <noscript><link rel="stylesheet" href="/_next/static/css/fdcd47c7ff7e10df.css"></noscript>  
  ...
</head>

It's great for web performance because a <link rel="stylesheet" href="css.css"> is a render-blocking thing and it makes the site feel slow on first load. I wish I didn't need this, but it comes from my lack of CSS styling skills to custom hand-code every bit of CSS and instead, I rely on a bloated CSS framework which comes as a massive kitchen sink.

To add critical CSS optimization in NextJS, you add:

experimental: { optimizeCss: true },

inside your next.config.js. Easy enough, but it slows down my site by a factor of ~80ms to ~230ms on my Intel Macbook per page rendered.
So see, if it wasn't for this need of critical CSS inlining, NextJS would be about ~80ms per page and that includes getting all the data via HTTP JSON for each page too.

Express caching middleware

My server.mjs looks like this (simplified):

import next from "next";

import renderCaching from "./middleware/render-caching.mjs";

const app = next({ dev });
const handle = app.getRequestHandler();

app
  .prepare()
  .then(() => {
    const server = express();

    // For Gzip and Brotli compression
    server.use(shrinkRay());

    server.use(renderCaching);

    server.use(handle);

    // Use the rollbar error handler to send exceptions to your rollbar account
    if (rollbar) server.use(rollbar.errorHandler());

    server.listen(port, (err) => {
      if (err) throw err;
      console.log(`> Ready on http://localhost:${port}`);
    });
  })

And the middleware/render-caching.mjs looks like this:

import express from "express";
import QuickLRU from "quick-lru";

const router = express.Router();

const cache = new QuickLRU({ maxSize: 1000 });

router.get("/*", async function renderCaching(req, res, next) {
  if (
    req.path.startsWith("/_next/image") ||
    req.path.startsWith("/_next/static") ||
    req.path.startsWith("/search")
  ) {
    return next();
  }

  const key = req.url;
  if (cache.has(key)) {
    res.setHeader("x-middleware-cache", "hit");
    const [body, headers] = cache.get(key);
    Object.entries(headers).forEach(([key, value]) => {
      if (key !== "x-middleware-cache") res.setHeader(key, value);
    });
    return res.status(200).send(body);
  } else {
    res.setHeader("x-middleware-cache", "miss");
  }

  const originalEndFunc = res.end.bind(res);
  res.end = function (body) {
    if (body && res.statusCode === 200) {
      cache.set(key, [body, res.getHeaders()]);
      // console.log(
      //   `HEAP AFTER CACHING ${(
      //     process.memoryUsage().heapUsed /
      //     1024 /
      //     1024
      //   ).toFixed(1)}MB`
      // );
    }
    return originalEndFunc(body);
  };

  next();
});

export default router;

It's far from perfect and I only just coded this yesterday afternoon. My server runs a single Node process so the max heap memory would theoretically be 1,000 x the average size of those response bodies. If you're worried about bloating your memory, just adjust the QuickLRU to something smaller.

Let's talk about your keys

In my basic version, I chose this cache key:

const key = req.url;

but that means that http://localhost:3000/foo?a=1 is different from http://localhost:3000/foo?b=2 which might be a mistake if you're certain that no rendering ever depends on a query string.

But this is totally up to you! For example, suppose that you know your site depends on the darkmode cookie, you can do something like this:

const key = `${req.path} ${req.cookies['darkmode']==='dark'} ${rec.headers['accept-language']}`

Or,

const key = req.path.startsWith('/search') ? req.url : req.path

Purging

As soon as I launched this code, I watched the log files, and voila!:

::ffff:127.0.0.1 [18/Feb/2022:12:59:36 +0000] GET /about HTTP/1.1 200 - - 422.356 ms
::ffff:127.0.0.1 [18/Feb/2022:12:59:43 +0000] GET /about HTTP/1.1 200 - - 1.133 ms

Cool. It works. But the problem with a simple LRU cache is that it's sticky. And it's stored inside a running process's memory. How is the Express server middleware supposed to know that the content has changed and needs a cache purge? It doesn't. It can't know. The only one that knows is my Django server which accepts the various write operations that I know are reasons to purge the cache. For example, if I approve a blog post comment or an edit to the page, it triggers the following (simplified) Python code:

import requests

def cache_purge(url):
    if settings.PURGE_URL:
        print(requests.get(settings.PURGE_URL, json={
           pathnames: [url]
        }, headers={
           "Authorization": f"Bearer {settings.PURGE_SECRET}"
        })

    if settings.KEYCDN_API_KEY:
        api = keycdn.Api(settings.KEYCDN_API_KEY)
        print(api.delete(
            f"zones/purgeurl/{settings.KEYCDN_ZONE_ID}.json", 
            {"urls": [url]}
        ))    

Now, let's go back to the simplified middleware/render-caching.mjs and look at how we can purge from the LRU over HTTP POST:

const cache = new QuickLRU({ maxSize: 1000 })

router.get("/*", async function renderCaching(req, res, next) {
// ... Same as above
});


router.post("/__purge__", async function purgeCache(req, res, next) {
  const { body } = req;
  const { pathnames } = body;
  try {
    validatePathnames(pathnames)
  } catch (err) {
    return res.status(400).send(err.toString());
  }

  const bearer = req.headers.authorization;
  const token = bearer.replace("Bearer", "").trim();
  if (token !== PURGE_SECRET) {
    return res.status(403).send("Forbidden");
  }

  const purged = [];

  for (const pathname of pathnames) {
    for (const key of cache.keys()) {
      if (
        key === pathname ||
        (key.startsWith("/_next/data/") && key.includes(`${pathname}.json`))
      ) {
        cache.delete(key);
        purged.push(key);
      }
    }
  }
  res.json({ purged });
});

What's cool about that is that it can purge both the regular HTML URL and it can also purge those _next/data/ URLs. Because when NextJS can hijack the <a> click, it can just request the data in JSON form and use existing React components to re-render the page with the different data. So, in a sense, GET /_next/data/RzG7kh1I6ZEmOAPWpdA7g/en/plog/nextjs-faster-with-express-caching.json?oid=nextjs-faster-with-express-caching is the same as GET /plog/nextjs-faster-with-express-caching because of how NextJS works. But in terms of content, they're the same. But worth pointing out that the same piece of content can be represented in different URLs.

Another thing to point out is that this caching is specifically about individual pages. In my blog, for example, the homepage is a mix of the 10 latest entries. But I know this within my Django server so when a particular blog post has been updated, for some reason, I actually send out a bunch of different URLs to the purge where I know its content will be included. It's not perfect but it works pretty well.

Conclusion

The hardest part about caching is cache invalidation. It's usually the inner core of a crux. Sometimes, you're so desperate to survive a stampeding herd problem that you don't care about cache invalidation but as a compromise, you just set the caching time-to-live short.

But I think the most important tenant of good caching is: have full control over it. I.e. don't take it lightly. Build something where you can fully understand and change how it works exactly to your specific business needs.

This idea of letting Express cache responses in memory isn't new but I didn't find any decent third-party solution on NPMJS that I liked or felt fully comfortable with. And I needed to tailor exactly to my specific setup.

Go forth and try it out on your own site! Not all sites or apps need this at all, but if you do, I hope I have inspired a foundation of a solution.

Please post a comment if you have thoughts or questions.

My site's now NextJS - And I (almost) regret it already

React, Django, Node, JavaScript

My personal blog was a regular Django website with jQuery (later switched to Cash) for dynamic bits. In December 2021 I rewrote it in NextJS. It was a fun journey and NextJS is great but it's really not without some regrets.

Some flashpoints for note and comparison:

React SSR is awesome

The way infinitely nested comments are rendered is isomorphic now. Before I had to code it once as a Jinja2 template thing and once as a Cash (a fork of jQuery) thing. That's the nice and the promise of JavaScript React and server-side rendering.

JS bloat

The total JS payload is now ~111KB in 16 files. It used to be ~36KB in 7 files. :(

Before

Before

After

After

Data still comes from Django

Like any website, the web pages are made up from A) getting the raw data from a database, B) rendering that data in HTML.
I didn't want to rewrite all the database queries in Node (inside getServerSideProps).

What I did was I moved all the data gathering Django code and put them under a /api/v1/ prefix publishing simple JSON blobs. Then this is exposed on 127.0.0.1:3000 which the Node server fetches. And I wired up that that API endpoint so I can debug it via the web too. E.g. /api/v1/plog/sort-a-javascript-array-by-some-boolean-operation

Now, all I have to do is write some TypeScript interfaces that hopefully match the JSON that comes from Django. For example, here's the getServerSideProps code for getting the data to this page:

const url = `${API_BASE}/api/v1/plog/`;
const response = await fetch(url);
if (!response.ok) {
  throw new Error(`${response.status} on ${url}`);
}
const data: ServerData = await response.json();
const { groups } = data;

return {
  props: {
    groups,
  },
};

I like this pattern! Yes, there are overheads and Node could talk directly to PostgreSQL but the upside is decoupling. And with good outside caching, performance never matters.

Server + CDN > static site generation

I considered full-blown static generation, but it's not an option. My little blog only has about 1,400 blog posts but you can also filter by tags and combinations of tags and pagination of combinations of tags. E.g. /oc-JavaScript/oc-Python/p3 So the total number of pages is probably in the tens of thousands.

So, server-side rendering it is. To accomplish that I set up a very simple Express server. It proxies some stuff over to the Django server (e.g. /rss.xml) and then lets NextJS handle the rest.

import next from "next";
import express from "express";

const app = next();
const handle = app.getRequestHandler();

app
  .prepare()
  .then(() => {
    const server = express();

    server.use(handle);

    server.listen(port, (err) => {
      if (err) throw err;
      console.log(`> Ready on http://localhost:${port}`);
    });
  })

Now, my site is behind a CDN. And technically, it's behind Nginx too where I do some proxy_pass in-memory caching as a second line of defense.
Requests come in like this:

  1. from user to CDN
  2. from CDN to Nginx
  3. from Nginx to Express (proxy_pass)
  4. from Express to next().getRequestHandler()

And I set Cache-Control in res.setHeader("Cache-Control", "public,max-age=86400") from within the getServerSideProps functions in the src/pages/**/*.tsx files. And once that's set, the response will be cached both in Nginx and in the CDN.

Any caching is tricky when you need to do revalidation. Especially when you roll out a new central feature in the core bundle. But I quite like this pattern of a slow-rolling upgrade as individual pages eventually expire throughout the day.

This is a nasty bug with this and I don't yet know how to solve it. Client-side navigation is dependent of hashing. So loading this page, when done with client-side navigation, becomes /_next/data/2ps5rE-K6E39AoF4G6G-0/en/plog.json (no, I don't know how that hashed URL is determined). But if a new deployment happens, the new URL becomes /_next/data/UhK9ANa6t5p5oFg3LZ5dy/en/plog.json so you end up with a 404 because you started on a page based on an old JavaScript bundle, that is now invalid.

Thankfully, NextJS handles it quite gracefully by throwing an error on the 404 so it proceeds with a regular link redirect which takes you away from the old page.

Client-side navigation still sucks. Kinda.

Next has a built-in <Link> component that you use like this:

import Link from "next/link";

...

<Link href={"/plog/" + post.oid}>
  {post.title}
</Link>

Now, clicking any of those links will automatically enable client-side routing. Thankfully, it takes care of preloading the necessary JavaScript (and CSS) simply by hovering over the link, so that when you eventually click it just needs to do an XHR request to get the JSON necessary to be able to render the page within the loaded app (and then do the pushState stuff to change the URL accordingly).

It sounds good in theory but it kinda sucks because unless you have a really good Internet connection (or could be you hit upon a CDN-cold URL), nothing happens when you click. This isn't NextJS's fault, but I wonder if it's actually horribly for users.

Yes, it sucks that a user clicks something but nothing happens. (I think it would be better if it was a button-press and not a link because buttons feel more like an app whereas links have deeply ingrained UX expectations). But most of the time, it's honestly very fast and when it works it's a nice experience. It's a great piece of functionality for more app'y sites, but less good for websites whose most of the traffic comes from direct links or Google searches.

NextJS has built-in critical CSS optimization

Critical inline CSS is critical (pun intended) for web performance. Especially on my poor site where I depend on a bloated (and now ancient) CSS framework called Semantic-UI. Without inline CSS, the minified CSS file would become over 200KB.

In NextJS, to enable inline critical CSS loading you just need to add this to your next.config.js:

    experimental: { optimizeCss: true },

and you have to add critters to your package.json. I've found some bugs with it but nothing I can't work around.

Conclusion and what's next

I'm very familiar and experienced with React but NextJS is new to me. I've managed to miss it all these years. Until now. So there's still a lot to learn. With other frameworks, I've always been comfortable that I don't actually understand how Webpack and Babel work (internally) but at least I understood when and how I was calling/depending on it. Now, with NextJS there's a lot of abstracted magic that I don't quite understand. It's hard to let go of that. It's hard to get powerful tools that are complex and created by large groups of people and understand it all too. If you're desperate to understand exactly how something works, you inevitably have to scale back the amount of stuff you're leveraging. (Note, it might be different if it's absolute core to what you do for work and hack on for 8 hours a day)

The JavaScript bundles in NextJS lazy-load quite decently but it's definitely more bloat than it needs to be. It's up to me to fix it, partially, because much of the JS code on my site is for things that technically can wait such as the interactive commenting form and the auto-complete search.

But here's the rub; my site is not an app. Most traffic comes from people doing a Google search, clicking on my page, and then bugger off. It's quite static that way and who am I to assume that they'll stay and click around and reuse all that loaded JavaScript code.

With that said; I'm going to start an experiment to rewrite the site again in Remix.

Please post a comment if you have thoughts or questions.

Sort a JavaScript array by some boolean operation

JavaScript

UPDATE Feb 4, 2022 Yes, as commenter Matthias pointed out, you can let JavaScript implicitly make the conversion of boolean minus boolean to end number a number of -1, 0, or 1.

Original but edited blog post below...


Imagine you have an array like this:

const items = [
  { num: 'one', labels: [] },
  { num: 'two', labels: ['foo'] },
  { num: 'three', labels: ['bar'] },
  { num: 'four', labels: ['foo'] },
  { num: 'five', labels: [] },
];

What you want, is to sort them in a way that all those entries that have a label foo come first, but you don't want to "disturb" the existing order. Essentially you want this to be the end result:

{ num: 'two', labels: ['foo'] },
{ num: 'four', labels: ['foo'] },

{ num: 'one', labels: [] },
{ num: 'three', labels: ['bar'] },
{ num: 'five', labels: [] },

Here's a way to do that:

items.sort(
  (itemA, itemB) =>
    itemB.labels.includes('foo') - itemA.labels.includes('foo')
);
console.log(items);

And the outcome is:

[
  { num: 'two', labels: [ 'foo' ] },
  { num: 'four', labels: [ 'foo' ] },
  { num: 'one', labels: [] },
  { num: 'three', labels: [ 'bar' ] },
  { num: 'five', labels: [] }
]

The simple trick is to turn then test operation into a number (0 or 1) and you can do that with Number.

Boolean minus boolean is operator overloaded to become an integer. From the Node repl...

> true - false
1
> false - true
-1
> false - false
0
> true - true
0

Please post a comment if you have thoughts or questions.

Brotli compression quality comparison in the real world

Node, JavaScript

At work, we use Brotli (using the Node builtin zlib) to compress these large .json files to .json.br files. When using zlib.brotliCompress you can set options to override the quality number. Here's an example of it at quality 6:

import { promisify } from 'util'
import zlib from 'zlib'
const brotliCompress = promisify(zlib.brotliCompress)

const options = {
  params: {
    [zlib.constants.BROTLI_PARAM_MODE]: zlib.constants.BROTLI_MODE_TEXT,
    [zlib.constants.BROTLI_PARAM_QUALITY]: 6,
  },
}

export async function compress(data) {
  return brotliCompress(data, options)
}

But what if you mess with that number. Surely, the files will become smaller, but at what cost? Well, I wrote a Node script that measured how long it would take to compress 6 large (~25MB each) .json file synchronously. Then, I put them into a Google spreadsheet and voila:

Size

Total size per level

Time

Total seconds per level

Miles away from rocket science but I thought it was cute to visualize as a way of understanding the quality option.

Please post a comment if you have thoughts or questions.