31 August 2019 0 comments Javascript
It started with this:
function walk(directory, filepaths = []) {
const files = fs.readdirSync(directory);
for (let filename of files) {
const filepath = path.join(directory, filename);
if (path.extname(filename) === '.md') {
filepaths.push(filepath);
} else if (fs.statSync(filepath).isDirectory()) {
walk(filepath, filepaths);
}
}
return filepaths;
}
And you use it like this:
const foundFiles = walk(someDirectoryOfMine);
console.log(foundFiles.length);
I thought, perhaps it's faster or better to use glob. So I installed that.
Then I found, fast-glob which sounds faster. You use both in a synchronous way.
I have a directory with about 450 files, of which 320 of them are .md files. Let's compare:
walk: 10.212ms
glob: 37.492ms
fg: 14.200ms
I measured it using
console.time like this:
console.time('walk');
const foundFiles = walk(someDirectoryOfMine);
console.timeEnd('walk');
console.log(foundFiles.length);
I suppose those packages have other fancier features but, I guess this just goes to show, keep it simple.
23 August 2019 0 comments Python
TextBlob is a wonderful Python library it. It wraps nltk with a really pleasant API. Out of the box, you get a spell-corrector. From the tutorial:
>>> from textblob import TextBlob
>>> b = TextBlob("I havv goood speling!")
>>> str(b.correct())
'I have good spelling!'
The way it works is that, shipped with the library, is this text file: en-spelling.txt It's about 30,000 lines long and looks like this:
;;; Based on several public domain books from Project Gutenberg
;;; and frequency lists from Wiktionary and the British National Corpus.
;;; http://norvig.com/big.txt
;;;
a 21155
aah 1
aaron 5
ab 2
aback 3
abacus 1
abandon 32
abandoned 72
abandoning 27
That gave me an idea! How about I use the TextBlob API but bring my own text as the training model. It doesn't have to be all that complicated.
The challenge
(Note: All the code I used for this demo is available here: github.com/peterbe/spellthese)
I found this site that lists "Top 1,000 Baby Boy Names". From that list, randomly pick a couple of out and mess with their spelling. Like, remove letters, add letters, and swap letters.
So, 5 random names now look like this:
▶ python challenge.py
RIGHT: jameson TYPOED: jamesone
RIGHT: abel TYPOED: aabel
RIGHT: wesley TYPOED: welsey
RIGHT: thomas TYPOED: thhomas
RIGHT: bryson TYPOED: brysn
Imagine some application, where fat-fingered users typo those names on the right-hand side, and your job is to map that back to the correct spelling.
First, let's use the built in TextBlob.correct. A bit simplified but it looks like this:
from textblob import TextBlob
correct, typo = get_random_name()
b = TextBlob(typo)
result = str(b.correct())
right = correct == result
...
And the results:
▶ python test.py
ORIGIN TYPO RESULT WORKED?
jesus jess less Fail
austin ausin austin Yes!
julian juluian julian Yes!
carter crarter charter Fail
emmett emett met Fail
daniel daiel daniel Yes!
luca lua la Fail
anthony anthonyh anthony Yes!
damian daiman cabman Fail
kevin keevin keeping Fail
Right 40.0% of the time
Buuh! Not very impressive. So what went wrong there? Well, the word met is much more common than emmett and the same goes for words like less, charter, keeping etc. You know, because English.
The solution
The solution is actually really simple. You just crack open the classes out of textblob like this:
from textblob import TextBlob
from textblob.en import Spelling
path = "spelling-model.txt"
spelling = Spelling(path=path)
# Here, 'names' is a list of all the 1,000 correctly spelled names.
# e.g. ['Liam', 'Noah', 'William', 'James', ...
spelling.train(" ".join(names), path)
Now, instead of corrected = str(TextBlob(typo).correct()) we do result = spelling.suggest(typo)[0][0] as demonstrated here:
correct, typo = get_random_name()
b = spelling.suggest(typo)
result = b[0][0]
right = correct == result
...
So, let's compare the two "side by side" and see how this works out. Here's the output of running with 20 randomly selected names:
▶ python test.py
UNTRAINED...
ORIGIN TYPO RESULT WORKED?
juan jaun juan Yes!
ethan etha the Fail
bryson brysn bryan Fail
hudson hudsn hudson Yes!
oliver roliver oliver Yes!
ryan rnyan ran Fail
cameron caeron carron Fail
christopher hristopher christopher Yes!
elias leias elias Yes!
xavier xvaier xvaier Fail
justin justi just Fail
leo lo lo Fail
adrian adian adrian Yes!
jonah ojnah noah Fail
calvin cavlin calvin Yes!
jose joe joe Fail
carter arter after Fail
braxton brxton brixton Fail
owen wen wen Fail
thomas thoms thomas Yes!
Right 40.0% of the time
TRAINED...
ORIGIN TYPO RESULT WORKED?
landon landlon landon Yes
sebastian sebstian sebastian Yes
evan ean ian Fail
isaac isaca isaac Yes
matthew matthtew matthew Yes
waylon ywaylon waylon Yes
sebastian sebastina sebastian Yes
adrian darian damian Fail
david dvaid david Yes
calvin calivn calvin Yes
jose ojse jose Yes
carlos arlos carlos Yes
wyatt wyatta wyatt Yes
joshua jsohua joshua Yes
anthony antohny anthony Yes
christian chrisian christian Yes
tristan tristain tristan Yes
theodore therodore theodore Yes
christopher christophr christopher Yes
joshua oshua joshua Yes
Right 90.0% of the time
See, with very little effort you can got from 40% correct to 90% correct.
Note, that the output of something like spelling.suggest('darian') is actually a list like this: [('damian', 0.5), ('adrian', 0.5)] and you can use that in your application. For example:
<li><a href="?name=damian">Did you mean <b>damian</b></a></li>
<li><a href="?name=adrian">Did you mean <b>adrian</b></a></li>
Bonus and conclusion
Ultimately, what TextBlob does is a re-implementation of Peter Norvig's original implementation from 2007. I too, have written my own implementation in 2007. Depending on your needs, you can just figure out the licensing of that source code and lift it out and implement in your custom ways. But TextBlob
wraps it up nicely for you.
When you use the textblob.en.Spelling class you have some choices. First, like I did in my demo:
path = "spelling-model.txt"
spelling = Spelling(path=path)
spelling.train(my_space_separated_text_blob, path)
What that does is creating a file spelling-model.txt that wasn't there before. It looks like this (in my demo):
▶ head spelling-model.txt
aaron 1
abel 1
adam 1
adrian 1
aiden 1
alexander 1
andrew 1
angel 1
anthony 1
asher 1
The number (on the right) there is the "frequency" of the word. But what if you have a "scoring" number of your own. Perhaps, in your application you just know that adrian is more right than damian. Then, you can make your own file:
Suppose the text file ("spelling-model-weighted.txt") contains lines like this:
...
adrian 8
damian 3
...
Now, the output becomes:
>>> import os
>>> from textblob.en import Spelling
>>> import os
>>> path = "spelling-model-weighted.txt"
>>> assert os.path.isfile(path)
>>> spelling = Spelling(path=path)
>>> spelling.suggest('darian')
[('adrian', 0.7272727272727273), ('damian', 0.2727272727272727)]Based on the weighting, these numbers add up. I.e. 3 / (3 + 8) == 0.2727272727272727
I hope it inspires you to write your own spelling application using TextBlob.
For example, you can feed it the names of your products on an e-commerce site. The .txt file might bloat if you have too much but note that the 30K lines en-spelling.txt is only 314KB and it loads in...:
>>> from textblob import TextBlob
>>> from time import perf_counter
>>> b = TextBlob("I havv goood speling!")
>>> t0 = perf_counter(); right = b.correct() ; t1 = perf_counter()
>>> t1 - t0
0.07055813199999861...70ms for 30,000 words.
15 August 2019 0 comments Javascript
I'm working on a CLI in Node. What the CLI does it that it takes one set of .json files, compute some stuff, and spits out a different set of .json files. But what it does is not important. I wanted the CLI to feel flexible and powerful but also quite forgiving. And if you typo something, it should bubble up an error rather than redirecting it to something like console.error("not a valid file!").
Basically, you use it like this:
node index.js /some/directory
# or
node index.js /some/directory /some/other/directory
# or
node index.js /some/directory/specificfile.json
# or
node index.js /some/directory/specificfile.json /some/directory/otherfile.json
# or
node index.js "/some/directory/*.json"
# or
node index.js "/some/directory/**/*.json"
(Note that when typing patterns in the shell you have quote them, otherwise the shell will do the expansion for you)
Or, any combination of all of these:
node index.js "/some/directory/**/*.json" /other/directory /some/specific/file.json
Whatever you use, with patterns, in particular, it has to make the final list of found files distinct and ordered by the order of the initial arguments.
Here's what I came up with:
import fs from "fs";
import path from "path";
// https://www.npmjs.com/package/glob
import glob from "glob";
/** Given an array of "things" return all distinct .json files.
*
* Note that these "things" can be a directory, a file path, or a
* pattern.
* Only if each thing is a directory do we search for *.json files
* in there recursively.
*/
function expandFiles(directoriesPatternsOrFiles) {
function findFiles(directory) {
const found = glob.sync(path.join(directory, "*.json"));
fs.readdirSync(directory, { withFileTypes: true })
.filter(dirent => dirent.isDirectory())
.map(dirent => path.join(directory, dirent.name))
.map(findFiles)
.forEach(files => found.push(...files));
return found;
}
const filePaths = [];
directoriesPatternsOrFiles.forEach(thing => {
let files = [];
if (thing.includes("*")) {
// It's a pattern!
files = glob.sync(thing);
} else {
const lstat = fs.lstatSync(thing);
if (lstat.isDirectory()) {
files = findFiles(thing);
} else if (lstat.isFile()) {
files = [thing];
} else {
throw new Error(`${thing} is neither file nor directory`);
}
}
files.forEach(p => filePaths.includes(p) || filePaths.push(p));
});
return filePaths;
}
This is where I'm bracing myself for comments that either point out something obvious that Node experts know or some awesome npm package that already does this but better.
If you have a typo, you get an error thrown that looks something like this:
Error: ENOENT: no such file or directory, lstat 'mydirectorrry'
(assuming mydirectory exists but mydirectorrry is a typo)
24 July 2019 0 comments Javascript, Web Performance, ReactJS, Web development
tl;dr; The previous (React) total JavaScript bundle size was: 36.2K Brotli compressed. The new (Preact) JavaScript bundle size was: 5.9K. I.e. 6 times smaller. Also, it appears to load faster in WebPageTest.
I have this page that is a Django server-side rendered page that has on it a form that looks something like this:
<div id="root">
<form action="https://songsear.ch/q/">
<input type="search" name="term" placeholder="Type your search here..." />
<button>Search</button>
</form>
</div>
It's a simple search form. But, to make it a bit better for users, I wrote a React widget that renders, into this document.querySelector('#root'), a near-identical <form> but with autocomplete functionality that displays suggestions as you type.
Anyway, I built that React bundle using create-react-app. I use the yarn run build command that generates...
css/main.83463791.chunk.css - 1.4Kjs/main.ec6364ab.chunk.js - 9.0K (gzip 2.8K, br 2.5K)js/runtime~main.a8a9905a.js - 1.5K (gzip 754B, br 688B)js/2.b944397d.chunk.js - 119K (gzip 36K, br 33K)
Then, in Python, a piece of post-processing code copies the files from the build/static/ directory and inserts it into the rendered HTML file. The CSS gets injected as an inline <style>
tag.
It's a simple little widget. No need for any service-workers or react-router or any global state stuff. (Actually, it only has 1 single runtime dependency outside the framework) I thought, how about moving this to Preact?
In comes preact-cli
The app used a couple of React hooks but they were easy to transform into class components. Now I just needed to run:
npx preact create --yarn widget name-of-my-preact-project
cd name-of-my-preact-project
mkdir src
cp ../name-of-React-project/src/App.js src/
code src/App.js
Then, I slowly moved over the src/App.js from the create-react-app project and slowly by slowly I did the various little things that you need to do. For example, to learn to build with preact build --no-prerender --no-service-worker and how I can override the default template.
Long story short, the new built bundles look like this:
style.82edf.css - 1.4Kbundle.d91f9.js - 18K (gzip 6.4K, br 5.9K)polyfills.9168d.js - 4.5K (gzip 1.8K, br 1.6K)
(The polyfills.9168d.js gets injected as a script tag if window.fetch is falsy)
Unfortunately, when I did the move from React to Preact I did make some small fixes. Doing the "migration" I noticed a block of code that was never used so that gives the build bundle from Preact a slight advantage. But I think it's nominal.
In conclusion: The previous total JavaScript bundle size was: 36.2K (Brotli compressed). The new JavaScript bundle size was: 5.9K (Brotli compressed). I.e. 6 times smaller. But if you worry about the total amount of JavaScript to parse and execute, the size difference uncompressed
was 129K vs. 18K. I.e. 7 times smaller. I can only speculate but I do suspect you need less CPU/battery to process 18K instead of 129K if CPU/batter matters more (or closer to) than network I/O.

Rendering speed difference
Rendering speed is so darn hard to measure on the web because the app is so small. Plus, there's so much else going on that matters.
However, using WebPageTest I can do a visual comparison with the "Mobile - Slow 3G" preset. It'll be a somewhat decent measurement of the total time of downloading, parsing and executing. Thing is, the server-side rended HTML form has a button. But the React/Preact widget that takes over the DOM hides that submit button. So, using the screenshots that WebPageTest provides, I can deduce that the Preact widget completes 0.8 seconds faster than the React widget. (I.e. instead of 4.4s it became 3.9s)
Truth be told, I'm not sure how predictable or reproducible is. I ran that WebPageTest visual comparison more than once and the results can vary significantly. I'm not even sure which run I'm referring to here (in the screenshot) but the React widget version was never faster.
Conclusion and thoughts
Unsurprisingly, Preact is smaller because you simply get less from that framework. E.g. synthetic events. I was lucky. My app uses onChange
which I could easily "migrate" to onInput and I managed to get it to work pretty easily. I'm glad the widget app was so small and that I don't depend on any React specific third-party dependencies.
But! In WebPageTest Visual Comparison it was on "Mobile - Slow 3G" which only represents a small portion of the traffic. Mobile is a huge portion of the traffic but "Slow 3G" is not. When you do a Desktop comparison the difference is roughtly 0.1s.
Also, in total, that page is made up of 3 major elements
- The server-side rendered HTML
- The progressive JavaScript widget (what this blog post is about)
- A piece of JavaScript initiated banner ad
That HTML controls the "First Meaningful Paint" which takes 3 seconds. And the whole shebang, including the banner ad, takes a total of about 9s. So, all this work of rewriting a React app to Preact saved me 0.8s out of the total of 9s.
Web performance is hard and complicated. Every little counts, but keep your eye on the big ticket items assuming there's something you can do about them.
At the time of writing, preact-cli uses Preact 8.2 and I'm eager to see how Preact X feels. Apparently, since April 2019, it's in beta. Looking forward to giving it a try!
13 July 2019 0 comments Javascript, Web development
I use localhost:3000 for a lot of different projects. It's the default port on create-react-app's dev server. The browser profile remains but projects come and go. There's a lot of old stuff in there that I have no longer any memory of adding.

Working in a recent single page app, I tried to use localStorage as a cache for some XHR requests and got: DOMException: "The quota has been exceeded.".
Wat?! I'm only trying to store a ~250KB JSON string. Surely that's far away from the mythical 5MB limit. Do I really have to lzw compress the string in and out to save room and pay for it in CPU cycles?
Better yet, find out what junk I still have in there.
Paste this into your Web Console (it's safe as milk):
Object.entries(localStorage).forEach(([k,v]) => console.log(k, v.length, (v.length / 1024).toFixed(1) + 'KB'))
The output looks something like this:

Or, sorted and filtered a bit:
Object.entries(localStorage).sort((a, b) => b[1].length -a[1].length).slice(0,5).forEach(
([k,v]) => console.log(k, v.length, (v.length / 1024).toFixed(1) + 'KB'));
Looks like this:

And for the record, summed total in kilobytes:
(Object.values(localStorage).map(x => x.length).reduce((a, b) => a + b) / 1024).toFixed(1) + 'KB';

Wrapping up
Seems my Firefox browser's localStorage limit is still 5MB.
Also, you can do the loop using localStorage.length and localStorage.key(n) and localStorage.getItem(localStorage.key(n)).length but using Object.entries(localStorage) seems neater.
I guess this means I can still use localStorage in my app. It seems I just need to localStorage.removeItem('massive-list:items') which sounds like an experiment, from eons ago, for seeing how much I can stuff in there.
11 July 2019 0 comments Redis, Nginx, Python, Django
By analyzing my Nginx logs, I've concluded that SongSearch's autocomplete JSON API now gets about 2.2 requests per second. I.e. these are XHR requests to /api/search/autocomplete?q=....
Roughly, 1.8 requests per second goes back to the Django/Elasticsearch backend. That's a hit ratio of 16%. These Django/Elasticsearch requests take roughly 200ms on average. I suspect about 150-180ms of that time is spent querying Elasticsearch, the rest being Python request/response and JSON "paperwork".

Caching strategy
Caching is hard because the queries are so vastly different over time. Had I put a Redis cache decorator on the autocomplete Django view function I'd quickly bloat Redis memory and cause lots of evictions.
What I used to do was something like this:
def search_autocomplete(request):
q = request.GET.get('q')
cache_key = None
if len(q) < 10:
cache_key = 'autocomplete:' + q
results = cache.get(cache_key)
if results is not None:
return http.JsonResponse(results)
results = _do_elastisearch_query(q)
if cache_key:
cache.set(cache_key, results, 60 * 60)
return http.JsonResponse(results)
However, after some simple benchmarking it was clear that using Nginx' uwsgi_cache it was much faster to let the cacheable queries terminate already at Nginx. So I changed the code to something like this:
def search_autocomplete(request):
q = request.GET.get('q')
results = _do_elastisearch_query(q)
response = http.JsonResponse(results)
if len(q) < 10:
patch_cache_control(response, public=True, max_age=60 * 60)
return response
The only annoying thing about Nginx caching is that purging is hard unless you go for that Nginx Plus (or whatever their enterprise version is called). But more annoying, to me, is that fact that I can't really see what this means for my server. When I was caching with Redis I could just use redis-cli and...
> INFO
...
# Memory
used_memory:123904288
used_memory_human:118.16M
...
Nginx Amplify
My current best tool for keeping an eye on Nginx is Nginx Amplify. It gives me some basic insights about the state of things. Here are some recent screenshots:



Thoughts and conclusion
Caching is hard. But it's also fun because it ties directly into performance work.
In my business logic, I chose that autocomplete queries that are between 1 and 9 characters are cacheable. And I picked a TTL of 60 minutes. At this point, I'm not sure exactly why I chose that logic but I remember doing some back-of-envelope calculations about what the hit ratio would be and roughly what that would mean in bytes in RAM. I definitely remember picking 60 minutes because I was nervous about bloating Nginx's memory usage. But as of today, I'm switching that up to 24 hours and let's see what that does to my current 16% Nginx cache hit ratio. At the moment, /var/cache/nginx-cache/ is only 34MB which isn't much.
Another crux with using uwsgi_cache (or proxy_cache) is that you can't control the cache key very well. When it was all in Python I was able to decide about the cache key myself. A plausible implementation is cache_key = q.lower().strip()
for example. That means you can protect your Elasticsearch backend from having to do {"q": "A"} and {"q": "a"}. Who knows, perhaps there is a way to hack this in Nginx without compiling in some Lua engine.
The ideal would be some user-friendly diagnostics tool that I can point somewhere, towards Nginx, that says how much my uwsgi_cache is hurting or saving me. Autocomplete is just one of many things going on on this single DigitalOcean server. There's also a big PostgreSQL server, a node-express cluster, a bunch of uwsgi workers, Redis, lots of cron job scripts, and of course a big honking Elasticsearch 6.
UPDATE (July 12 2019)
Currently, and as mentioned above, I only set Cache-Control
headers (which means Nginx snaps it up) for queries that at max 9 characters long. I wanted to appreciate and understand how ratio of all queries are longer than 9 characters so I wrote a report and its output is this:
POINT: 7
Sum show 75646 32.2%
Sum rest 159321 67.8%
POINT: 8
Sum show 83702 35.6%
Sum rest 151265 64.4%
POINT: 9
Sum show 90870 38.7%
Sum rest 144097 61.3%
POINT: 10
Sum show 98384 41.9%
Sum rest 136583 58.1%
POINT: 11
Sum show 106093 45.2%
Sum rest 128874 54.8%
POINT: 12
Sum show 113905 48.5%
Sum rest 121062 51.5%
It means that (independent of time expiry) 38.7% of queries are 9 characters or less.