Peterbe.com

A blog and website by Peter Bengtsson

Train your own spell corrector with TextBlob

23 August 2019 0 comments   Python

https://github.com/peterbe/spellthese


TextBlob is a wonderful Python library it. It wraps nltk with a really pleasant API. Out of the box, you get a spell-corrector. From the tutorial:

>>> from textblob import TextBlob
>>> b = TextBlob("I havv goood speling!")
>>> str(b.correct())
'I have good spelling!'

The way it works is that, shipped with the library, is this text file: en-spelling.txt It's about 30,000 lines long and looks like this:

;;;   Based on several public domain books from Project Gutenberg
;;;   and frequency lists from Wiktionary and the British National Corpus.
;;;   http://norvig.com/big.txt
;;;   
a 21155
aah 1
aaron 5
ab 2
aback 3
abacus 1
abandon 32
abandoned 72
abandoning 27

That gave me an idea! How about I use the TextBlob API but bring my own text as the training model. It doesn't have to be all that complicated.

The challenge

(Note: All the code I used for this demo is available here: github.com/peterbe/spellthese)

I found this site that lists "Top 1,000 Baby Boy Names". From that list, randomly pick a couple of out and mess with their spelling. Like, remove letters, add letters, and swap letters.

So, 5 random names now look like this:

▶ python challenge.py
RIGHT: jameson  TYPOED: jamesone
RIGHT: abel     TYPOED: aabel
RIGHT: wesley   TYPOED: welsey
RIGHT: thomas   TYPOED: thhomas
RIGHT: bryson   TYPOED: brysn

Imagine some application, where fat-fingered users typo those names on the right-hand side, and your job is to map that back to the correct spelling.

First, let's use the built in TextBlob.correct. A bit simplified but it looks like this:

from textblob import TextBlob


correct, typo = get_random_name()
b = TextBlob(typo)
result = str(b.correct())
right = correct == result
...

And the results:

▶ python test.py
ORIGIN         TYPO           RESULT         WORKED?
jesus          jess           less           Fail
austin         ausin          austin         Yes!
julian         juluian        julian         Yes!
carter         crarter        charter        Fail
emmett         emett          met            Fail
daniel         daiel          daniel         Yes!
luca           lua            la             Fail
anthony        anthonyh       anthony        Yes!
damian         daiman         cabman         Fail
kevin          keevin         keeping        Fail
Right 40.0% of the time

Buuh! Not very impressive. So what went wrong there? Well, the word met is much more common than emmett and the same goes for words like less, charter, keeping etc. You know, because English.

The solution

The solution is actually really simple. You just crack open the classes out of textblob like this:

from textblob import TextBlob
from textblob.en import Spelling

path = "spelling-model.txt"
spelling = Spelling(path=path)
# Here, 'names' is a list of all the 1,000 correctly spelled names.
# e.g. ['Liam', 'Noah', 'William', 'James', ...
spelling.train(" ".join(names), path)

Now, instead of corrected = str(TextBlob(typo).correct()) we do result = spelling.suggest(typo)[0][0] as demonstrated here:

correct, typo = get_random_name()
b = spelling.suggest(typo)
result = b[0][0]
right = correct == result
...

So, let's compare the two "side by side" and see how this works out. Here's the output of running with 20 randomly selected names:

▶ python test.py
UNTRAINED...
ORIGIN         TYPO           RESULT         WORKED?
juan           jaun           juan           Yes!
ethan          etha           the            Fail
bryson         brysn          bryan          Fail
hudson         hudsn          hudson         Yes!
oliver         roliver        oliver         Yes!
ryan           rnyan          ran            Fail
cameron        caeron         carron         Fail
christopher    hristopher     christopher    Yes!
elias          leias          elias          Yes!
xavier         xvaier         xvaier         Fail
justin         justi          just           Fail
leo            lo             lo             Fail
adrian         adian          adrian         Yes!
jonah          ojnah          noah           Fail
calvin         cavlin         calvin         Yes!
jose           joe            joe            Fail
carter         arter          after          Fail
braxton        brxton         brixton        Fail
owen           wen            wen            Fail
thomas         thoms          thomas         Yes!
Right 40.0% of the time

TRAINED...
ORIGIN         TYPO           RESULT         WORKED?
landon         landlon        landon         Yes
sebastian      sebstian       sebastian      Yes
evan           ean            ian            Fail
isaac          isaca          isaac          Yes
matthew        matthtew       matthew        Yes
waylon         ywaylon        waylon         Yes
sebastian      sebastina      sebastian      Yes
adrian         darian         damian         Fail
david          dvaid          david          Yes
calvin         calivn         calvin         Yes
jose           ojse           jose           Yes
carlos         arlos          carlos         Yes
wyatt          wyatta         wyatt          Yes
joshua         jsohua         joshua         Yes
anthony        antohny        anthony        Yes
christian      chrisian       christian      Yes
tristan        tristain       tristan        Yes
theodore       therodore      theodore       Yes
christopher    christophr     christopher    Yes
joshua         oshua          joshua         Yes
Right 90.0% of the time

See, with very little effort you can got from 40% correct to 90% correct.

Note, that the output of something like spelling.suggest('darian') is actually a list like this: [('damian', 0.5), ('adrian', 0.5)] and you can use that in your application. For example:

<li><a href="?name=damian">Did you mean <b>damian</b></a></li>
<li><a href="?name=adrian">Did you mean <b>adrian</b></a></li>

Bonus and conclusion

Ultimately, what TextBlob does is a re-implementation of Peter Norvig's original implementation from 2007. I too, have written my own implementation in 2007. Depending on your needs, you can just figure out the licensing of that source code and lift it out and implement in your custom ways. But TextBlob wraps it up nicely for you.

When you use the textblob.en.Spelling class you have some choices. First, like I did in my demo:

path = "spelling-model.txt"
spelling = Spelling(path=path)
spelling.train(my_space_separated_text_blob, path)

What that does is creating a file spelling-model.txt that wasn't there before. It looks like this (in my demo):

▶ head spelling-model.txt
aaron 1
abel 1
adam 1
adrian 1
aiden 1
alexander 1
andrew 1
angel 1
anthony 1
asher 1

The number (on the right) there is the "frequency" of the word. But what if you have a "scoring" number of your own. Perhaps, in your application you just know that adrian is more right than damian. Then, you can make your own file:

Suppose the text file ("spelling-model-weighted.txt") contains lines like this:

...
adrian 8
damian 3
...

Now, the output becomes:

>>> import os
>>> from textblob.en import Spelling
>>> import os
>>> path = "spelling-model-weighted.txt"
>>> assert os.path.isfile(path)
>>> spelling = Spelling(path=path)
>>> spelling.suggest('darian')
[('adrian', 0.7272727272727273), ('damian', 0.2727272727272727)]

Based on the weighting, these numbers add up. I.e. 3 / (3 + 8) == 0.2727272727272727

I hope it inspires you to write your own spelling application using TextBlob.

For example, you can feed it the names of your products on an e-commerce site. The .txt file might bloat if you have too much but note that the 30K lines en-spelling.txt is only 314KB and it loads in...:

>>> from textblob import TextBlob
>>> from time import perf_counter
>>> b = TextBlob("I havv goood speling!")
>>> t0 = perf_counter(); right = b.correct() ; t1 = perf_counter()
>>> t1 - t0
0.07055813199999861

...70ms for 30,000 words.

function expandFiles(directoriesPatternsOrFiles)

15 August 2019 0 comments   Javascript

https://gist.github.com/peterbe/48e4a12d60c339cc6acbfe2bb6c7bbeb


I'm working on a CLI in Node. What the CLI does it that it takes one set of .json files, compute some stuff, and spits out a different set of .json files. But what it does is not important. I wanted the CLI to feel flexible and powerful but also quite forgiving. And if you typo something, it should bubble up an error rather than redirecting it to something like console.error("not a valid file!").

Basically, you use it like this:

node index.js /some/directory
# or
node index.js /some/directory /some/other/directory
# or 
node index.js /some/directory/specificfile.json
# or
node index.js /some/directory/specificfile.json /some/directory/otherfile.json
# or
node index.js "/some/directory/*.json"
# or 
node index.js "/some/directory/**/*.json"

(Note that when typing patterns in the shell you have quote them, otherwise the shell will do the expansion for you)

Or, any combination of all of these:

node index.js "/some/directory/**/*.json" /other/directory /some/specific/file.json 

Whatever you use, with patterns, in particular, it has to make the final list of found files distinct and ordered by the order of the initial arguments.

Here's what I came up with:

import fs from "fs";
import path from "path";
// https://www.npmjs.com/package/glob
import glob from "glob";


/** Given an array of "things" return all distinct .json files.
 *
 * Note that these "things" can be a directory, a file path, or a
 * pattern.
 * Only if each thing is a directory do we search for *.json files
 * in there recursively.
 */
function expandFiles(directoriesPatternsOrFiles) {
  function findFiles(directory) {
    const found = glob.sync(path.join(directory, "*.json"));

    fs.readdirSync(directory, { withFileTypes: true })
      .filter(dirent => dirent.isDirectory())
      .map(dirent => path.join(directory, dirent.name))
      .map(findFiles)
      .forEach(files => found.push(...files));

    return found;
  }

  const filePaths = [];
  directoriesPatternsOrFiles.forEach(thing => {
    let files = [];
    if (thing.includes("*")) {
      // It's a pattern!
      files = glob.sync(thing);
    } else {
      const lstat = fs.lstatSync(thing);
      if (lstat.isDirectory()) {
        files = findFiles(thing);
      } else if (lstat.isFile()) {
        files = [thing];
      } else {
        throw new Error(`${thing} is neither file nor directory`);
      }
    }
    files.forEach(p => filePaths.includes(p) || filePaths.push(p));
  });
  return filePaths;
}

This is where I'm bracing myself for comments that either point out something obvious that Node experts know or some awesome npm package that already does this but better.

If you have a typo, you get an error thrown that looks something like this:

Error: ENOENT: no such file or directory, lstat 'mydirectorrry'

(assuming mydirectory exists but mydirectorrry is a typo)

A React vs. Preact case study for a widget

24 July 2019 0 comments   Javascript, Web Performance, ReactJS, Web development


tl;dr; The previous (React) total JavaScript bundle size was: 36.2K Brotli compressed. The new (Preact) JavaScript bundle size was: 5.9K. I.e. 6 times smaller. Also, it appears to load faster in WebPageTest.

I have this page that is a Django server-side rendered page that has on it a form that looks something like this:

<div id="root">  
  <form action="https://songsear.ch/q/">  
    <input type="search" name="term" placeholder="Type your search here..." />
    <button>Search</button>
  </form>  
</div>

It's a simple search form. But, to make it a bit better for users, I wrote a React widget that renders, into this document.querySelector('#root'), a near-identical <form> but with autocomplete functionality that displays suggestions as you type.

Anyway, I built that React bundle using create-react-app. I use the yarn run build command that generates...

Then, in Python, a piece of post-processing code copies the files from the build/static/ directory and inserts it into the rendered HTML file. The CSS gets injected as an inline <style> tag.

It's a simple little widget. No need for any service-workers or react-router or any global state stuff. (Actually, it only has 1 single runtime dependency outside the framework) I thought, how about moving this to Preact?

In comes preact-cli

The app used a couple of React hooks but they were easy to transform into class components. Now I just needed to run:

npx preact create --yarn widget name-of-my-preact-project
cd name-of-my-preact-project
mkdir src
cp ../name-of-React-project/src/App.js src/
code src/App.js

Then, I slowly moved over the src/App.js from the create-react-app project and slowly by slowly I did the various little things that you need to do. For example, to learn to build with preact build --no-prerender --no-service-worker and how I can override the default template.

Long story short, the new built bundles look like this:

(The polyfills.9168d.js gets injected as a script tag if window.fetch is falsy)

Unfortunately, when I did the move from React to Preact I did make some small fixes. Doing the "migration" I noticed a block of code that was never used so that gives the build bundle from Preact a slight advantage. But I think it's nominal.

In conclusion: The previous total JavaScript bundle size was: 36.2K (Brotli compressed). The new JavaScript bundle size was: 5.9K (Brotli compressed). I.e. 6 times smaller. But if you worry about the total amount of JavaScript to parse and execute, the size difference uncompressed was 129K vs. 18K. I.e. 7 times smaller. I can only speculate but I do suspect you need less CPU/battery to process 18K instead of 129K if CPU/batter matters more (or closer to) than network I/O.

WebPageTest - Visual Comparison - Mobile Slow 3G

Rendering speed difference

Rendering speed is so darn hard to measure on the web because the app is so small. Plus, there's so much else going on that matters.

However, using WebPageTest I can do a visual comparison with the "Mobile - Slow 3G" preset. It'll be a somewhat decent measurement of the total time of downloading, parsing and executing. Thing is, the server-side rended HTML form has a button. But the React/Preact widget that takes over the DOM hides that submit button. So, using the screenshots that WebPageTest provides, I can deduce that the Preact widget completes 0.8 seconds faster than the React widget. (I.e. instead of 4.4s it became 3.9s)

Truth be told, I'm not sure how predictable or reproducible is. I ran that WebPageTest visual comparison more than once and the results can vary significantly. I'm not even sure which run I'm referring to here (in the screenshot) but the React widget version was never faster.

Conclusion and thoughts

Unsurprisingly, Preact is smaller because you simply get less from that framework. E.g. synthetic events. I was lucky. My app uses onChange which I could easily "migrate" to onInput and I managed to get it to work pretty easily. I'm glad the widget app was so small and that I don't depend on any React specific third-party dependencies.

But! In WebPageTest Visual Comparison it was on "Mobile - Slow 3G" which only represents a small portion of the traffic. Mobile is a huge portion of the traffic but "Slow 3G" is not. When you do a Desktop comparison the difference is roughtly 0.1s.

Also, in total, that page is made up of 3 major elements

  1. The server-side rendered HTML
  2. The progressive JavaScript widget (what this blog post is about)
  3. A piece of JavaScript initiated banner ad

That HTML controls the "First Meaningful Paint" which takes 3 seconds. And the whole shebang, including the banner ad, takes a total of about 9s. So, all this work of rewriting a React app to Preact saved me 0.8s out of the total of 9s.

Web performance is hard and complicated. Every little counts, but keep your eye on the big ticket items assuming there's something you can do about them.

At the time of writing, preact-cli uses Preact 8.2 and I'm eager to see how Preact X feels. Apparently, since April 2019, it's in beta. Looking forward to giving it a try!

Find out all localStorage keys and their value sizes

13 July 2019 0 comments   Javascript, Web development


I use localhost:3000 for a lot of different projects. It's the default port on create-react-app's dev server. The browser profile remains but projects come and go. There's a lot of old stuff in there that I have no longer any memory of adding.

My Storage tab in Firefox

Working in a recent single page app, I tried to use localStorage as a cache for some XHR requests and got: DOMException: "The quota has been exceeded.".
Wat?! I'm only trying to store a ~250KB JSON string. Surely that's far away from the mythical 5MB limit. Do I really have to lzw compress the string in and out to save room and pay for it in CPU cycles?

Better yet, find out what junk I still have in there.

Paste this into your Web Console (it's safe as milk):

Object.entries(localStorage).forEach(([k,v]) => console.log(k, v.length, (v.length / 1024).toFixed(1) + 'KB'))

The output looks something like this:

Web Console output

Or, sorted and filtered a bit:

Object.entries(localStorage).sort((a, b) => b[1].length -a[1].length).slice(0,5).forEach(
([k,v]) => console.log(k, v.length, (v.length / 1024).toFixed(1) + 'KB'));

Looks like this:

Sorted and sliced

And for the record, summed total in kilobytes:

(Object.values(localStorage).map(x => x.length).reduce((a, b) => a + b) / 1024).toFixed(1) + 'KB';

Summed in KB

Wrapping up

Seems my Firefox browser's localStorage limit is still 5MB.

Also, you can do the loop using localStorage.length and localStorage.key(n) and localStorage.getItem(localStorage.key(n)).length but using Object.entries(localStorage) seems neater.

I guess this means I can still use localStorage in my app. It seems I just need to localStorage.removeItem('massive-list:items') which sounds like an experiment, from eons ago, for seeing how much I can stuff in there.

SongSearch autocomplete rate now 2+ per second

11 July 2019 0 comments   Redis, Nginx, Python, Django


By analyzing my Nginx logs, I've concluded that SongSearch's autocomplete JSON API now gets about 2.2 requests per second. I.e. these are XHR requests to /api/search/autocomplete?q=....

Roughly, 1.8 requests per second goes back to the Django/Elasticsearch backend. That's a hit ratio of 16%. These Django/Elasticsearch requests take roughly 200ms on average. I suspect about 150-180ms of that time is spent querying Elasticsearch, the rest being Python request/response and JSON "paperwork".

Autocomplete counts in Datadog

Caching strategy

Caching is hard because the queries are so vastly different over time. Had I put a Redis cache decorator on the autocomplete Django view function I'd quickly bloat Redis memory and cause lots of evictions.

What I used to do was something like this:

def search_autocomplete(request):
   q = request.GET.get('q') 

   cache_key = None
   if len(q) < 10:
      cache_key = 'autocomplete:' + q
      results = cache.get(cache_key)
      if results is not None:
          return http.JsonResponse(results)

   results = _do_elastisearch_query(q)
   if cache_key:
       cache.set(cache_key, results, 60 * 60)

   return http.JsonResponse(results)   

However, after some simple benchmarking it was clear that using Nginx' uwsgi_cache it was much faster to let the cacheable queries terminate already at Nginx. So I changed the code to something like this:

def search_autocomplete(request):
   q = request.GET.get('q') 
   results = _do_elastisearch_query(q)
   response = http.JsonResponse(results)   

   if len(q) < 10:
       patch_cache_control(response, public=True, max_age=60 * 60)

   return response

The only annoying thing about Nginx caching is that purging is hard unless you go for that Nginx Plus (or whatever their enterprise version is called). But more annoying, to me, is that fact that I can't really see what this means for my server. When I was caching with Redis I could just use redis-cli and...

> INFO
...
# Memory
used_memory:123904288
used_memory_human:118.16M
...

Nginx Amplify

My current best tool for keeping an eye on Nginx is Nginx Amplify. It gives me some basic insights about the state of things. Here are some recent screenshots:

NGINX Requests/s

NGINX Memory Usage

NGINX CPU Usage %

Thoughts and conclusion

Caching is hard. But it's also fun because it ties directly into performance work.

In my business logic, I chose that autocomplete queries that are between 1 and 9 characters are cacheable. And I picked a TTL of 60 minutes. At this point, I'm not sure exactly why I chose that logic but I remember doing some back-of-envelope calculations about what the hit ratio would be and roughly what that would mean in bytes in RAM. I definitely remember picking 60 minutes because I was nervous about bloating Nginx's memory usage. But as of today, I'm switching that up to 24 hours and let's see what that does to my current 16% Nginx cache hit ratio. At the moment, /var/cache/nginx-cache/ is only 34MB which isn't much.

Another crux with using uwsgi_cache (or proxy_cache) is that you can't control the cache key very well. When it was all in Python I was able to decide about the cache key myself. A plausible implementation is cache_key = q.lower().strip() for example. That means you can protect your Elasticsearch backend from having to do {"q": "A"} and {"q": "a"}. Who knows, perhaps there is a way to hack this in Nginx without compiling in some Lua engine.

The ideal would be some user-friendly diagnostics tool that I can point somewhere, towards Nginx, that says how much my uwsgi_cache is hurting or saving me. Autocomplete is just one of many things going on on this single DigitalOcean server. There's also a big PostgreSQL server, a node-express cluster, a bunch of uwsgi workers, Redis, lots of cron job scripts, and of course a big honking Elasticsearch 6.

UPDATE (July 12 2019)

Currently, and as mentioned above, I only set Cache-Control headers (which means Nginx snaps it up) for queries that at max 9 characters long. I wanted to appreciate and understand how ratio of all queries are longer than 9 characters so I wrote a report and its output is this:

POINT: 7
Sum show 75646 32.2%
Sum rest 159321 67.8%

POINT: 8
Sum show 83702 35.6%
Sum rest 151265 64.4%

POINT: 9
Sum show 90870 38.7%
Sum rest 144097 61.3%

POINT: 10
Sum show 98384 41.9%
Sum rest 136583 58.1%

POINT: 11
Sum show 106093 45.2%
Sum rest 128874 54.8%

POINT: 12
Sum show 113905 48.5%
Sum rest 121062 51.5%

It means that (independent of time expiry) 38.7% of queries are 9 characters or less.

From jQuery to Cash

18 June 2019 3 comments   Javascript, Web development


tl;dr; The main JavaScript bundle goes from 29KB to 6KB by switching from JQuery to Cash. Both with Brotli compression.

In Web Performance, every byte counts. Downloading less stuff means faster network operations but for JavaScript it also means less to parse and execute. This site used use JQuery 3.4.1 but now uses Cash 4.1.2. It requires some changes to how you use $ and most noticeable is the lack of animations and $.ajax.

I still stand by the $ function. It's great when you have a regular (static) website that isn't a single page app but still needs a little bit of interactive JavaScript functionality. On this site, I use it for making the commenting work and some various navigation/header stuff.

Switching to Cash means you have to stop doing things like $.getJSON() and $('.classname').fadeIn(400) which, in a sense, gives Cash an unfair advantage because those bits take up a large portion of the bundle size. Yes, there is a custom build of jQuery without those but check out this size comparison:

BundleUncompressed (bytes)Gzipped (bytes)
jQuery 3.4.188,14530,739
jQuery 3.4.1 Slim71,03724,403
Cash 4.1.214,8185,167

I still needed a fadeIn function, which I was relying on from jQuery, but to remedy that I just copied one of these from youmightnotneedjquery.com. It would be better to not do that an use a CSS transform instead but, well, I'm only human.

Before: with jQuery
Before: with jQuery

Another thing you'll need to replace is to switch from $.ajax to fetch but there are good polyfills but I haven't bothered with polyfills because the tiny percentage of visitors I have, without fetch support still get a working site but can't post comments.

I was contemplating doing what GitHub did in 2018 which was to replace jQuery with real vanilla JavaScript code but it didn't seem worth it now that Cash is only 5KB (gzipped) and it's an actively maintained project too.

Before: with jQuery
Before: with jQuery

After: with Cash
After: with Cash