Why didn't I know about machma?!

07 June 2017   0 comments   Linux, MacOSX, Go

"machma - Easy parallel execution of commands with live feedback"

This is so cool! https://github.com/fd0/machma

It's a command line program that makes it really easy to run any command line program in parallel. I.e. in separate processes with separate CPUs.

Something network bound

Suppose I have a file like this:

▶ wc -l urls.txt
      30 urls.txt

▶ cat urls.txt | head -n 3

If I wanted to download all of these files with wget the traditional way would be:

▶ time cat urls.txt | xargs wget -q -P ./downloaded/
cat urls.txt  0.00s user 0.00s system 53% cpu 0.005 total
xargs wget -q -P ./downloaded/  0.07s user 0.24s system 2% cpu 14.913 total

▶ ls downloaded | wc -l

▶ du -sh downloaded
 21M    downloaded

So it took 15 seconds to download 30 files that totals 21MB.

Now, let's do it with machama instead:

▶ time cat urls.txt | machma -- wget -q -P ./downloaded/ {}
cat urls.txt  0.00s user 0.00s system 55% cpu 0.004 total
machma -- wget -q -P ./downloaded/ {}  0.53s user 0.45s system 12% cpu 7.955 total

That uses 8 separate processors (because my laptop has 8 CPUs).
Because 30 / 8 ~= 4, it roughly does 4 iterations.

But note, it took 15 seconds to download 30 files synchronously. That's an average of 0.5s per file. The reason it doesn't take 4x0.5 seconds (instead of 8 seconds) is because it's at the mercy of bad luck and some of those 30 spiking a bit.

Something CPU bound

Now let's do something really CPU intensive; Guetzli compression.

▶ ls images | wc -l

▶ time find images -iname '*.jpg' | xargs -I {} guetzli --quality 85 {} compressed/{}
find images -iname '*.jpg'  0.00s user 0.00s system 40% cpu 0.009 total
xargs -I {} guetzli --quality 85 {} compressed/{}  35.74s user 0.68s system 99% cpu 36.560 total

And now the same but with machma:

▶ time find images -iname '*.jpg' | machma -- guetzli --quality 85 {} compressed/{}

processed 7 items (0 failures) in 0:10
find images -iname '*.jpg'  0.00s user 0.00s system 51% cpu 0.005 total
machma -- guetzli --quality 85 {} compressed/{}  58.47s user 0.91s system 546% cpu 10.857 total

Basically, it took only 11 seconds. This time there were fewer images (7) than there was CPUs (8), so basically the poor computer is doing super intensive CPU (and memory) work across all CPUs at the same time. The average time for each of these files is ~5 seconds so it's really interesting that even if you try to do this in parallel execution instead of taking a total of ~5 seconds, it took almost double that.

In conclusion

Such a handy tool to have around for command line stuff. I haven't looked at its code much but it's almost a shame that the project only has 300+ GitHub stars. Perhaps because it's kinda complete and doesn't need much more work.

Also, if you attempt all the examples above you'll notice that when you use the ... | xargs ... approach the stdout and stderr is a mess. For wget, that's why I used -q to silence it a bit. With machma you get a really pleasant color coded live output that tells you the state of the queue, possible failures and an ETA.


Your email will never ever be published

Related posts

Experimenting with Guetzli 24 May 2017
Fastest way to find out if a file exists in S3 (with boto3) 16 June 2017
Related by Keyword:
The ideal number of workers in Jest 08 October 2018
Concurrent Gzip in Python 13 October 2017
Experimenting with Guetzli 24 May 2017
Time to do concurrent CPU bound work 13 May 2016
Decorated Concurrency - Python multiprocessing made really really easy 13 May 2016
Related by Text:
jQuery and Highslide JS 08 January 2008
I'm back! Peterbe.com has been renewed 05 June 2005
Anti-McCain propaganda videos 12 August 2008
Ever wondered how much $87 Billion is? 04 November 2003
Guake, not Yakuake or Yeahconsole 23 January 2010