## Converting Celsius to Fahrenheit round-up

### July 22, 20240 commentsGo, Node, Python, Bun, Ruby, Rust, JavaScript

In the last couple of days, I've created variations of a simple algorithm to demonstrate how Celcius and Fahrenheit seem to relate to each other if you "mirror the number".
It wasn't supposed to be about the programming language. Still, I used Python in the first one and I noticed that since the code is simple, it could be fun to write variants of it in other languages.

It was a fun exercise.

And speaking of fun, I couldn't help but to throw in a benchmark using `hyperfine` that measures, essentially, how fast these CLIs can start up. The results look like this:

``````
Summary
./conversion-rs ran
1.31 ± 1.30 times faster than ./conversion-go
1.88 ± 1.33 times faster than ./conversion-cr
7.15 ± 4.64 times faster than bun run conversion.ts
14.27 ± 9.48 times faster than python3.12 conversion.py
18.10 ± 12.35 times faster than node conversion.js
67.75 ± 43.80 times faster than ruby conversion.rb
``````

It doesn't prove much, that you didn't expect. But it's fun to see how fast Python 3.12 has become at starting up.

Head on over to https://github.com/peterbe/temperature-conversion to play along. Perhaps you can see some easy optimizations (speed and style).

## Converting Celsius to Fahrenheit with Go

This is a continuation of Converting Celsius to Fahrenheit with Python, and TypeScript, but in Go:

``````
package main

import (
"fmt"
"math"
)

func c2f(c int) float64 {
return float64(c)*9/5 + 32
}

func isMirror(a int, b int) bool {
return reverseString(massage(a)) == massage(b)
}

func massage(n int) string {
switch {
case n < 10:
return fmt.Sprintf("0%d", n)
case n >= 100:
return massage(n - 100)
default:
return fmt.Sprintf("%d", n)
}
}

func reverseString(s string) string {
runes := []rune(s)
for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
runes[i], runes[j] = runes[j], runes[i]
}
return string(runes)
}

func printConversion(c int, f int) {
fmt.Println(fmt.Sprintf("%d°C ~= %d°F", c, f))
}

func main() {
for c := 4; c < 100; c += 12 {
var f = c2f(c)
if isMirror(c, int(math.Ceil(f))) {
printConversion(c, int(math.Ceil(f)))
} else if isMirror(c, int(math.Floor(f))) {
printConversion(c, int(math.Floor(f)))
} else {
break
}
}
}
``````

Run it like this:

``````
go run conversion.go
``````

or build and run:

``````
go build -o conversion-go conversion.go
./conversion-go
``````

and the output becomes:

```4°C ~= 40°F
16°C ~= 61°F
28°C ~= 82°F
40°C ~= 104°F
52°C ~= 125°F```

## Unzip benchmark on AWS EC2 c3.large vs c4.large

### November 29, 201718 commentsPython, Linux, Mozilla, Go

This web app I'm working on gets a blob of bytes from a HTTP POST. The nature of the blob is a 100MB to 1,100MB blob of a zip file. What my app currently does is that it takes this byte buffer, uses Python's built in `zipfile` to extract all its content to a temporary directory. A second function then loops over the files within this extracted tree and processes each file in multiple threads with `concurrent.futures.ThreadPoolExecutor`. Here's the core function itself:

``````
def dump_and_extract(root_dir, file_buffer):
zf = zipfile.ZipFile(file_buffer)
zf.extractall(root_dir)
``````

So far so good.

### Speed Speed Speed

I quickly noticed that this is amounting to quite a lot of time spent doing the unzip and the writing to disk. What to do????

At first I thought I'd shell out to good old `unzip`. E.g. `unzip -d /tmp/tempdirextract /tmp/input.zip` but that has two flaws:

1) I'd first have to dump the blob of bytes to disk and do the overhead of shelling out (i.e. Python `subprocess`)
2) It's actually not faster. Did some experimenting and got the same results at Alex Martelli in this Stackoverflow post

What about disk speed? Yeah, this is likely to be a part of the total time. The servers that run the `symbols.mozilla.org` service runs on AWS EC2 `c4.large`. This only has EBS (Elastic Block Storage). However, AWS EC2 `c3.large` looks interesting since it's using SSD disks. That's probably a lot faster. Right?

Note! For context, the kind of .zip files I'm dealing with contain many small files and often 1-2 really large ones.

### EC2s Benchmarking

I create two EC2 nodes to experiment on. One `c3.large` and one `c4.large`. Both running Ubuntu 16.04.

Next, I have this little benchmarking script which loops over a directory full of .zip files between 200MB-600MB large. Roughly 10 of them. It then loads each one, one at a time, into memory and calls the `dump_and_extract`. Let's run it on each EC2 instance:

#### On `c4.large`

```c4.large\$ python3 fastest-dumper.py /tmp/massive-symbol-zips
138.2MB/s            291.1MB              2.107s
146.8MB/s            314.5MB              2.142s
144.8MB/s            288.2MB              1.990s
84.5MB/s             532.4MB              6.302s
146.6MB/s            314.2MB              2.144s
136.5MB/s            270.7MB              1.984s
85.9MB/s             518.9MB              6.041s
145.2MB/s            306.8MB              2.113s
127.8MB/s            138.7MB              1.085s
107.3MB/s            454.8MB              4.239s
141.6MB/s            251.2MB              1.774s

Average speed: 127.7MB/s
Median speed:  138.2MB/s

Average files created:       165
Average directories created: 129```

#### On `c3.large`

```c3.large\$ python3 fastest-dumper.py -t /mnt/extracthere /tmp/massive-symbol-zips
105.4MB/s            290.9MB              2.761s
98.1MB/s             518.5MB              5.287s
108.1MB/s            251.2MB              2.324s
112.5MB/s            294.3MB              2.615s
113.7MB/s            314.5MB              2.767s
106.3MB/s            291.5MB              2.742s
104.8MB/s            291.1MB              2.778s
114.6MB/s            248.3MB              2.166s
114.2MB/s            248.2MB              2.173s
105.6MB/s            298.1MB              2.823s
106.2MB/s            297.6MB              2.801s
98.6MB/s             521.4MB              5.289s

Average speed: 107.3MB/s
Median speed:  106.3MB/s

Average files created:       165
Average directories created: 127```

What the heck!? The SSD based instance is 23% slower!

I ran it a bunch of times and the average and median numbers are steady. `c4.large` is faster than `c3.large` at unzipping large blobs to disk. So much for that SSD!

### Something Weird Is Going On

It's highly likely that the unzipping work is CPU bound and that most of those, for example, 5 seconds is spent unzipping and only a small margin is the time it takes to write to disk.

If the unzipping CPU work is the dominant "time consumer" why is there a difference at all?!

Or, is the "compute power" the difference between c3 and c4 and disk writes immaterial?

For the record, this test clearly demonstrates that the locally mounted SSD drive is 600% faster than ESB.

```c3.large\$ dd if=/dev/zero of=/tmp/1gbtest bs=16k count=65536
65536+0 records in
65536+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 16.093 s, 66.7 MB/s
c3.large\$ sudo dd if=/dev/zero of=/mnt/1gbtest bs=16k count=65536
65536+0 records in
65536+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.62728 s, 409 MB/s```

Let's try again. But instead of using `c4.large` and `c3.large`, let's use the beefier `c4.4xlarge` and `c3.4xlarge`. Both have 16 vCPUs.

#### `c4.4xlarge`

```c4.4xlarge\$ python3 fastest-dumper.py /tmp/massive-symbol-zips
130.6MB/s            553.6MB              4.238s
149.2MB/s            297.0MB              1.991s
129.1MB/s            529.8MB              4.103s
116.8MB/s            407.1MB              3.486s
147.3MB/s            306.1MB              2.077s
151.9MB/s            248.2MB              1.634s
140.8MB/s            292.3MB              2.076s
146.8MB/s            288.0MB              1.961s
142.2MB/s            321.0MB              2.257s

Average speed: 139.4MB/s
Median speed:  142.2MB/s

Average files created:       148
Average directories created: 117```

#### `c3.4xlarge`

```c3.4xlarge\$ python3 fastest-dumper.py -t /mnt/extracthere /tmp/massive-symbol-zips
95.1MB/s             502.4MB              5.285s
104.1MB/s            303.5MB              2.916s
115.5MB/s            313.9MB              2.718s
105.5MB/s            517.4MB              4.904s
114.1MB/s            288.1MB              2.526s
103.3MB/s            555.9MB              5.383s
114.0MB/s            288.0MB              2.526s
109.2MB/s            251.2MB              2.300s
108.0MB/s            291.0MB              2.693s

Average speed: 107.6MB/s
Median speed:  108.0MB/s

Average files created:       150
Average directories created: 119```

What's going on!? The time it takes to unzip and write to disk is, on average, the same for `c3.large` as `c3.4xlarge`!

### Is Go Any Faster?

I need a break. As mentioned above, the `unzip` command line program is not any better than doing it in Python. But Go is faster right? Right?

Please first accept that I'm not a Go programmer even though I can use it to build stuff but really my experience level is quite shallow.

Here's the Go version. Critical function that does the unzipping and extraction to disk here:

``````
func DumpAndExtract(dest string, buffer []byte, name string) {
size := int64(len(buffer))
if err != nil {
log.Fatal(err)
}
for _, f := range zipReader.File {
rc, err := f.Open()
if err != nil {
log.Fatal(err)
}
defer rc.Close()
fpath := filepath.Join(dest, f.Name)
if f.FileInfo().IsDir() {
os.MkdirAll(fpath, os.ModePerm)
} else {
// Make File
var fdir string
if lastIndex := strings.LastIndex(fpath, string(os.PathSeparator)); lastIndex > -1 {
fdir = fpath[:lastIndex]
}
err = os.MkdirAll(fdir, os.ModePerm)
if err != nil {
log.Fatal(err)
}
f, err := os.OpenFile(
fpath, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, f.Mode())
if err != nil {
log.Fatal(err)
}
defer f.Close()

_, err = io.Copy(f, rc)
if err != nil {
log.Fatal(err)
}
}
}
}
``````

And the measurement is done like this:

``````
size := int64(len(content))
t0 := time.Now()
DumpAndExtract(tmpdir, content, filename)
t1 := time.Now()
speed := float64(size) / t1.Sub(t0).Seconds()
``````

It's not as sophisticated (since it's only able to use `/tmp`) but let's just run it see how it compares to Python:

```c4.4xlarge\$ mkdir ~/GO
c4.4xlarge\$ export GOPATH=~/GO
c4.4xlarge\$ go get github.com/pyk/byten
c4.4xlarge\$ go build unzips.go
c4.4xlarge\$ ./unzips /tmp/massive-symbol-zips
56MB/s         407MB          7.27804954
74MB/s         321MB          4.311504933
75MB/s         288MB          3.856798853
75MB/s         292MB          3.90972474
81MB/s         248MB          3.052652168
58MB/s         530MB          9.065985117
59MB/s         554MB          9.35237202
75MB/s         297MB          3.943132388
74MB/s         306MB          4.147176578

Average speed:    70MB/s
Median speed:     81MB/s```

So... Go is, on average, 40% slower than Python in this scenario. Did not expect that.

### In Conclusion

No conclusion. Only confusion.

I thought this would be a lot clearer and more obvious. Yeah, I know it's crazy to measure two things at the same time (unzip and disk write) but the whole thing started with a very realistic problem that I'm trying to solve. The ultimate question was; will the performance benefit from us moving the web servers from AWS EC2 `c4.large` to `c3.large` and I think the answer is no.

## UPDATE (Nov 30, 2017)

Here's a horrible hack that causes the extraction to always go to `/dev/null`:

``````
class DevNullZipFile(zipfile.ZipFile):
def _extract_member(self, member, targetpath, pwd):
# member.is_dir() only works in Python 3.6
if member.filename[-1] == '/':
return targetpath
dest = '/dev/null'
with self.open(member, pwd=pwd) as source, open(dest, "wb") as target:
shutil.copyfileobj(source, target)
return targetpath

def dump_and_extract(root_dir, file_buffer, klass):
zf = klass(file_buffer)
zf.extractall(root_dir)
``````

And here's the outcome of running that:

```c4.4xlarge\$ python3 fastest-dumper.py --dev-null /tmp/massive-symbol-zips
170.1MB/s            297.0MB              1.746s
168.6MB/s            306.1MB              1.815s
147.1MB/s            553.6MB              3.765s
132.1MB/s            407.1MB              3.083s
145.6MB/s            529.8MB              3.639s
175.4MB/s            248.2MB              1.415s
163.3MB/s            321.0MB              1.965s
162.1MB/s            292.3MB              1.803s
168.5MB/s            288.0MB              1.709s

Average speed: 159.2MB/s
Median speed:  163.3MB/s

Average files created:       0
Average directories created: 0```

I ran it a few times to make sure the numbers are stable. They are. This is on the `c4.4xlarge`.

So, the improvement of writing to `/dev/null` instead of the ESB `/tmp` is 15%. Kinda goes to show how much of the total time is spent reading the `ZipInfo` file object.

For the record, the same comparison on the `c3.4xlarge` was 30% improvement when using `/dev/null`.

Also for the record, if I replace that line `shutil.copyfileobj(source, target)` above with `pass`, the average speed goes from 159.2MB/s to 112.8GB/s but that's not a real value of any kind.

## UPDATE (Nov 30, 2017)

Here's the same benchmark using `c5.4xlarge` instead. So, still EBS but...
"3.0 GHz Intel Xeon Platinum processors with new Intel Advanced Vector Extension 512 (AVX-512) instruction set"

Let's run it on this supposedly faster CPU:

```c5.4xlarge\$ python3 fastest-dumper.py /tmp/massive-symbol-zips
165.6MB/s            314.6MB              1.900s
163.3MB/s            287.7MB              1.762s
155.2MB/s            278.6MB              1.795s
140.9MB/s            513.2MB              3.643s
137.4MB/s            556.9MB              4.052s
134.6MB/s            531.0MB              3.946s
165.7MB/s            314.2MB              1.897s
158.1MB/s            301.5MB              1.907s
151.6MB/s            253.8MB              1.674s
146.9MB/s            502.7MB              3.422s
163.7MB/s            288.0MB              1.759s

Average speed: 153.0MB/s
Median speed:  155.2MB/s

Average files created:       150
Average directories created: 119```

So that is, on average, 10% faster than `c4.4xlarge`.

Is it 10% more expensive? For a 1-year reserved instance, it's \$0.796 versus \$0.68 respectively. I.e. 15% more expensive. In other words, in this context it's 15% more \$\$\$ for 10% more processing power.

## UPDATE (Jan 24, 2018)

I can almost not believe it!

Thanks you Oliver who discovered (see comment below) a blaring mistake in my last conclusion. The for reserved instances (which is what we use on my Mozilla production servers) the `c5.4xlarge` is actually cheaper than `c4.4xlarge`. What?!

In my previous update I compared `c4.4xlarge` and `c5.4xlarge` and concluded that `c5.4xlarge` is 10% faster but 15% more expensive. That actually made sense. Fancier servers, more \$\$\$. But it's not like that in the real world. See for yourself:

`c4.4xlarge`

`c5.4xlarge`

## Why didn't I know about machma?!

### "machma - Easy parallel execution of commands with live feedback"

This is so cool! https://github.com/fd0/machma

It's a command line program that makes it really easy to run any command line program in parallel. I.e. in separate processes with separate CPUs.

### Something network bound

Suppose I have a file like this:

```▶ wc -l urls.txt
30 urls.txt

▶ cat urls.txt | head -n 3
https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/wntdll.pdb/D74F79EB1F8D4A45ABCD2F476CCABACC2/wntdll.sym
https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/firefox.pdb/448794C699914DB8A8F9B9F88B98D7412/firefox.sym
```

If I wanted to download all of these files with `wget` the traditional way would be:

```▶ time cat urls.txt | xargs wget -q -P ./downloaded/
cat urls.txt  0.00s user 0.00s system 53% cpu 0.005 total
xargs wget -q -P ./downloaded/  0.07s user 0.24s system 2% cpu 14.913 total

30

```

So it took 15 seconds to download 30 files that totals 21MB.

Now, let's do it with `machama` instead:

```▶ time cat urls.txt | machma -- wget -q -P ./downloaded/ {}
cat urls.txt  0.00s user 0.00s system 55% cpu 0.004 total
machma -- wget -q -P ./downloaded/ {}  0.53s user 0.45s system 12% cpu 7.955 total
```

That uses 8 separate processors (because my laptop has 8 CPUs).
Because 30 / 8 ~= 4, it roughly does 4 iterations.

But note, it took 15 seconds to download 30 files synchronously. That's an average of 0.5s per file. The reason it doesn't take 4x0.5 seconds (instead of 8 seconds) is because it's at the mercy of bad luck and some of those 30 spiking a bit.

### Something CPU bound

Now let's do something really CPU intensive; Guetzli compression.

```▶ ls images | wc -l
7

▶ time find images -iname '*.jpg' | xargs -I {} guetzli --quality 85 {} compressed/{}
find images -iname '*.jpg'  0.00s user 0.00s system 40% cpu 0.009 total
xargs -I {} guetzli --quality 85 {} compressed/{}  35.74s user 0.68s system 99% cpu 36.560 total
```

And now the same but with `machma`:

```▶ time find images -iname '*.jpg' | machma -- guetzli --quality 85 {} compressed/{}

processed 7 items (0 failures) in 0:10
find images -iname '*.jpg'  0.00s user 0.00s system 51% cpu 0.005 total
machma -- guetzli --quality 85 {} compressed/{}  58.47s user 0.91s system 546% cpu 10.857 total
```

Basically, it took only 11 seconds. This time there were fewer images (7) than there was CPUs (8), so basically the poor computer is doing super intensive CPU (and memory) work across all CPUs at the same time. The average time for each of these files is ~5 seconds so it's really interesting that even if you try to do this in parallel execution instead of taking a total of ~5 seconds, it took almost double that.

### In conclusion

Such a handy tool to have around for command line stuff. I haven't looked at its code much but it's almost a shame that the project only has 300+ GitHub stars. Perhaps because it's kinda complete and doesn't need much more work.

Also, if you attempt all the examples above you'll notice that when you use the `... | xargs ...` approach the stdout and stderr is a mess. For `wget`, that's why I used `-q` to silence it a bit. With `machma` you get a really pleasant color coded live output that tells you the state of the queue, possible failures and an ETA.

## Autocompeter is Dead. Long live Autocompeter!

### January 9, 20170 commentsPython, Web development, Go

About 2 years ago I launched Autocompeter.com. It was two parts:

1) A autocompeter.js pure JavaScript solution to add autocomplete to a search input field.
2) A REST API where you can submit titles with a HTTP header key, and a fancy autocomplete search.

## Only Rewrote the Go + Redis part

The second part has now been completely re-written. The server was originally written in Go and used Redis. Now it's Django and ElasticSearch.

The ultimate reason for this was that Redis was, by far, the biggest memory consumer on my shared DigitalOcean server. The way it worked was that every prefix of every word in every title was indexes as a key. For example the words `p`, `pe`, `pet`, `pete`, `peter` and `peter\$` are all keys and they point to an array of IDs that you then look up to get the distinct set of titles and their URLs. This makes it really really fast but since redis doesn't support namespaces, or multiple columns it means that for every prefix it needs a prefix of its own for the domain they belong to. So the hash for `www.peterbe.com` is `eb9f747` so the strings to store are instead `eb9f747p`, `eb9f747pe`, `eb9f747pet`, `eb9f747pete`, `eb9f747peter` and `eb9f747peter\$`.

ElasticSearch on the other hand has ALL of this built in deep in Lucene. AND you can filter. So the way it's queried now instead is something like this:

``````
search = TitleDoc.search()
search = search.filter('term', domain=domain.name)
search = search.query(Q('match_phrase', title=request.GET['q']))
search = search.sort('-popularity', '_score')
search = search[:size]
response = search.execute()
...
``````

And here's how the mapping is defined:

``````
from elasticsearch_dsl import (
DocType,
Float,
Text,
Index,
analyzer,
Keyword,
token_filter,
)

edge_ngram_analyzer = analyzer(
'edge_ngram_analyzer',
type='custom',
tokenizer='standard',
filter=[
'lowercase',
token_filter(
'edge_ngram_filter', type='edgeNGram',
min_gram=1, max_gram=20
)
]
)

class TitleDoc(DocType):
id = Keyword()
domain = Keyword(required=True)
url = Keyword(required=True, index=False)
title = Text(
required=True,
analyzer=edge_ngram_analyzer,
search_analyzer='standard'
)
popularity = Float()
group = Keyword()
``````

I'm learning ElasticSearch rapidly but I still feel like I have so much to learn. This solution I have here is quite good and I'm pretty happy with the results but I bet there's a lot of things I can learn to make it even better.

## Why Ditch Go?

I actually had a lot of fun building the first server version of Autocompeter in Go but Django is just so many times more convenient. It's got management commands, ORM, authentication system, CSRF protection, awesome error reporting, etc. All built in! With Go I had to build everything from scratch.

Also, I felt like the important thing here is the JavaScript client and the database. Now that I've proven this to work with Django and `elasticsearch-dsl` I think it wouldn't be too hard to re-write the critical query API in Go or in something like Sanic for maximum performance.

## All Dockerized

Oh, one of the reasons I wanted to do this new server in Python is because I want to learn Docker better and in particular Docker with Python projects.

The project is now entirely contained in Docker so you can start the PostgreSQL, ElasticSearch 5.1.1 and Django with `docker-compose up`. There might be a couple of things I've forgot to document for how to configure things but this is actually the first time I've developed something entirely in Docker.

## Benchmarking Autocompeter

Now that Autocompeter.com is launched I can publish some preliminary benchmarks of "real" usage. It's all on my MacBook Pro on a local network with a local Redis but it's quite telling that it's pretty fast.

What I did was I started with a completely empty Redis database then I did the following things:

First of all, I bulk load in 1035 "documents" (110Kb of data). This takes about 0.44 seconds consistently!

1. GET on the home page (not part of the API and thus quite unimportant in terms of performance)
2. GET on a search with a single character ("p") expecting 10 results (e.g. `/v1?d=mydomain&q=p`)
3. GET on a search with a full word ("python") expecting 10 results
4. GET on a search with a full word that isn't in the index ("xxxxxxxx") expecting 0 results
5. GET on a search with two words ("python", "te") expecting 4 results
6. GET on a search with two words that aren't in the index ("xxxxxxx", "yyyyyy") expecting 0 results

In each benchmark I use wrk with 10 connections, lasting 5 seconds, using 5 threads.

And for each round I try with 1 processor, 2 processors and 8 processors (my laptop's max according to `runtime.NumCPU()`).

I ran it a bunch of times and recorded the last results for each number of processors.
The results are as follows:

Notes

• Every search incurs a write in the form of incrementing a counter.
• Searching on more than one word causes an ZINTERSTORE.