Avoid async when all you have is (SSD) disk I/O in NodeJS

24 October 2019   0 comments   Javascript, Node

tl;dr; If you know that the only I/O you have is disk and the disk is SSD, then synchronous is probably more convenient, faster, and more memory lean.

I'm not a NodeJS expert so I could really do with some eyes on this.

There is little doubt in my mind that it's smart to use asynchronous ideas when your program has to wait for network I/O. Because network I/O is slow, it's better to let your program work on something else whilst waiting. But disk is actually fast. Especially if you have SSD disk.

The context

I'm working on a Node program that walks a large directory structure and looks for certain file patterns, reads those files, does some processing and then exits. It's a cli basically and it's supposed to work similar to jest where you tell it to go and process files and if everything worked, exit with 0 and if anything failed, exit with something >0. Also, it needs to be possible to run it so that it exits immediately on the first error encountered. This is similar to running jest --bail.

My program needs to process thousands of files and although there are thousands of files, they're all relatively small. So first I wrote a simple reference program: https://github.com/peterbe/megafileprocessing/blob/master/reference.js
What it does is that it walks a directory looking for certain .json files that have certain keys that it knows about. Then, just computes the size of the values and tallies that up. My real program will be very similar except it does a lot more with each .json file.

You run it like this:

CHAOS_MONKEY=0.001 node reference.js ~/stumptown-content/kumadocs -q
Error: Chaos Monkey!
    at processDoc (/Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:37:11)
    at /Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:80:21
    at Array.forEach (<anonymous>)
    at main (/Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:78:9)
    at Object.<anonymous> (/Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:99:20)
    at Module._compile (internal/modules/cjs/loader.js:956:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:973:10)
    at Module.load (internal/modules/cjs/loader.js:812:32)
    at Function.Module._load (internal/modules/cjs/loader.js:724:14)
    at Function.Module.runMain (internal/modules/cjs/loader.js:1025:10)
Total length for 4057 files is 153953645
1 files failed.

(The environment variable CHAOS_MONKEY=0.001 makes it so there's a 0.1% chance it throws an error)

It processed 4,057 files and one of those failed (thanks to the "chaos monkey").
In its current state that (on my MacBook) that takes about 1 second.

It's not perfect but it's a good skeleton. Everything is synchronous. E.g.

function main(args) {
  // By default, don't exit if any error happens
  const { bail, quiet, root } = parseArgs(args);
  const files = walk(root, ".json");
  let totalTotal = 0;
  let errors = 0;
  files.forEach(file => {
    try {
      const total = processDoc(file, quiet);
      !quiet && console.log(`${file} is ${total}`);
      totalTotal += total;
    } catch (err) {
      if (bail) {
        throw err;
      } else {
        console.error(err);
        errors++;
      }
    }
  });
  console.log(`Total length for ${files.length} files is ${totalTotal}`);
  if (errors) {
    console.warn(`${errors} files failed.`);
  }
  return errors ? 1 : 0;
}

And inside the processDoc function it used const content = fs.readFileSync(fspath, "utf8");.

I/Os compared

@amejiarosario has a great blog post called "What every programmer should know about Synchronous vs. Asynchronous Code". In it, he has this great bar chart:

Latency vs. System Event

If you compare "SSD I/O" with "Network SFO/NCY" the difference is that SSD I/O is 456 times "faster" than SFO-to-NYC network I/O. I.e. the latency is 456 times less.

Another important aspect when processing lots of files is garbage collection. When running synchronous, it can garbage collect as soon as it has processed one file before moving on to the next. If it was asynchronous, as soon as it yields to move on to the next file, it might hold on to memory from the first file. Why does this matter? Because if the memory-usage when processing many files asynchronously bloat so hard that it actually crashes with an out-of-memory error. So what matters is avoiding that. It's OK if the program can use lots of memory if it needs to, but it's really bad if it crashes.

One way to measure this is to use /usr/bin/time -l (at least that's what it's called on macOS). For example:

▶ /usr/bin/time -l node reference.js ~/stumptown-content/kumadocs -q
Total length for 4057 files is 153970749
        0.75 real         0.58 user         0.23 sys
  57221120  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     64160  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
      1074  involuntary context switches

Its maximum memory usage total was 57221120 bytes (55MB) in this example.

Introduce asynchronous file reading

Let's change the reference implementation to use const content = await fsPromises.readFile(fspath, "utf8");. We're still using files.forEach(file => { but within the loop the whole function is prefixed with async function main() { now. Like this:

async function main(args) {
  // By default, don't exit if any error happens
  const { bail, quiet, root } = parseArgs(args);
  const files = walk(root, ".json");
  let totalTotal = 0;
  let errors = 0;

  let total;
  for (let file of files) {
    try {
      total = await processDoc(file, quiet);
      !quiet && console.log(`${file} is ${total}`);
      totalTotal += total;
    } catch (err) {
      if (bail) {
        throw err;
      } else {
        console.error(err);
        errors++;
      }
    }
  }
  console.log(`Total length for ${files.length} files is ${totalTotal}`);
  if (errors) {
    console.warn(`${errors} files failed.`);
  }
  return errors ? 1 : 0;
}

Let's see how it works:

▶ /usr/bin/time -l node async1.js ~/stumptown-content/kumadocs -q
Total length for 4057 files is 153970749
        1.31 real         1.01 user         0.49 sys
  68898816  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     68107  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
     62562  involuntary context switches

That means it maxed out at 68898816 bytes (65MB).

You can already see a difference. 0.79 seconds and 55MB for synchronous and 1.31 seconds and 65MB for asynchronous.

But to really measure this, I wrote a simple Python program that runs this repeatedly and reports a min/median on time and max on memory:

▶ python3 wrap_time.py /usr/bin/time -l node reference.js ~/stumptown-content/kumadocs -q
...
TIMES
BEST:   0.74s
WORST:  0.84s
MEAN:   0.78s
MEDIAN: 0.78s
MAX MEMORY
BEST:   53.5MB
WORST:  55.3MB
MEAN:   54.6MB
MEDIAN: 54.8MB

And for the asynchronous version:

▶ python3 wrap_time.py /usr/bin/time -l node async1.js ~/stumptown-content/kumadocs -q
...
TIMES
BEST:   1.28s
WORST:  1.82s
MEAN:   1.39s
MEDIAN: 1.31s
MAX MEMORY
BEST:   65.4MB
WORST:  67.7MB
MEAN:   66.7MB
MEDIAN: 66.9MB

Promise.all version

I don't know if the async1.js is realistic. More realistically you'll want to not wait for one file to be processed (asynchronously) but start them all at the same time. So I made a variation of the asynchronous version that looks like this instead:

async function main(args) {
  // By default, don't exit if any error happens
  const { bail, quiet, root } = parseArgs(args);
  const files = walk(root, ".json");
  let totalTotal = 0;
  let errors = 0;

  let values;
  values = await Promise.all(
    files.map(async file => {
      try {
        total = await processDoc(file, quiet);
        !quiet && console.log(`${file} is ${total}`);
        return total;
      } catch (err) {
        if (bail) {
          console.error(err);
          process.exit(1);
        } else {
          console.error(err);
          errors++;
        }
      }
    })
  );
  totalTotal = values.filter(n => n).reduce((a, b) => a + b);
  console.log(`Total length for ${files.length} files is ${totalTotal}`);
  if (errors) {
    console.warn(`${errors} files failed.`);
    throw new Error("More than 0 errors");
  }
}

You can see the whole file here: async2.js

The key difference is that it uses await Promise.all(files.map(...)) instead of for (let file of files) {.
Also, to accomplish the ability to bail on the first possible error it needs to use process.exit(1); within the callbacks. Not sure if that's right but from the outside, you get the desired effect as a cli program. Let's measure it too:

▶ python3 wrap_time.py /usr/bin/time -l node async2.js ~/stumptown-content/kumadocs -q
...
TIMES
BEST:   1.44s
WORST:  1.61s
MEAN:   1.52s
MEDIAN: 1.52s
MAX MEMORY
BEST:   434.0MB
WORST:  460.2MB
MEAN:   453.4MB
MEDIAN: 456.4MB

Note how this uses almost 10x max. memory. That's dangerous if the processing is really memory hungry individually.

When asynchronous is right

In all of this, I'm assuming that the individual files are small. (Roughly, each file in my experiment is about 50KB)
What if the files it needs to read from disk are large?

As a simple experiment read /users/peterbe/Downloads/Keybase.dmg 20 times and just report its size:

for (let x = 0; x < 20; x++) {
  fs.readFile("/users/peterbe/Downloads/Keybase.dmg", (err, data) => {
    if (err) throw err;
    console.log(`File size#${x}: ${Math.round(data.length / 1e6)} MB`);
  });
}

See the simple-async.js here. Basically it's this:

for (let x = 0; x < 20; x++) {
  fs.readFile("/users/peterbe/Downloads/Keybase.dmg", (err, data) => {
    if (err) throw err;
    console.log(`File size#${x}: ${Math.round(data.length / 1e6)} MB`);
  });
}

Results are:

▶ python3 wrap_time.py /usr/bin/time -l node simple-async.js
...
TIMES
BEST:   0.84s
WORST:  4.32s
MEAN:   1.33s
MEDIAN: 0.97s
MAX MEMORY
BEST:   1851.1MB
WORST:  3079.3MB
MEAN:   2956.3MB
MEDIAN: 3079.1MB

And the equivalent synchronous simple-sync.js here.

for (let x = 0; x < 20; x++) {
  const largeFile = fs.readFileSync("/users/peterbe/Downloads/Keybase.dmg");
  console.log(`File size#${x}: ${Math.round(largeFile.length / 1e6)} MB`);
}

It performs like this:

▶ python3 wrap_time.py /usr/bin/time -l node simple-sync.js
...
TIMES
BEST:   1.97s
WORST:  2.74s
MEAN:   2.27s
MEDIAN: 2.18s
MAX MEMORY
BEST:   1089.2MB
WORST:  1089.7MB
MEAN:   1089.5MB
MEDIAN: 1089.5MB

So, almost 2x as slow but 3x as much max. memory.

Lastly, instead of an iterative loop, let's start 20 readers at the same time (simple-async2.js):

Promise.all(
  [...Array(20).fill()].map((_, x) => {
    return fs.readFile("/users/peterbe/Downloads/Keybase.dmg", (err, data) => {
      if (err) throw err;
      console.log(`File size#${x}: ${Math.round(data.length / 1e6)} MB`);
    });
  })
);

And it performs like this:

▶ python3 wrap_time.py /usr/bin/time -l node simple-async2.js
...
TIMES
BEST:   0.86s
WORST:  1.09s
MEAN:   0.96s
MEDIAN: 0.94s
MAX MEMORY
BEST:   3079.0MB
WORST:  3079.4MB
MEAN:   3079.2MB
MEDIAN: 3079.2MB

So quite naturally, the same total time as the simple async version but uses 3x max. memory every time.

Ergonomics

I'm starting to get pretty comfortable with using promises and async/await. But I definitely feel more comfortable without. Synchronous programs read better from an ergonomics point of view. The async/await stuff is just Promises under the hood and it's definitely an improvement but the synchronous versions just have a simpler "feeling" to it.

Conclusion

I don't think it's a surprise that the overhead of event switching adds more time than its worth when the individual waits aren't too painful.

A major flaw with synchronous programs is that they rely on the assumption that there's no really slow I/O. So what if the program grows and morphs so that it someday does depend on network I/O then your synchronous program is "screwed" since an asynchronous version would run circles around it.

The general conclusion is; if you know that the only I/O you have is disk and the disk is SSD, then synchronous is probably more convenient, faster, and more memory lean.

Comments

Your email will never ever be published

Related posts

Previous:
Update to speed comparison for Redis vs PostgreSQL storing blobs of JSON 30 September 2019
Next:
MDN Documents Size Tree Map 14 November 2019
Related by Keyword:
To defer or to async JavaScript tags. That's the question. 29 June 2018
Unzip benchmark on AWS EC2 c3.large vs c4.large 29 November 2017
Never put external Javascript in the <head> 02 April 2013
How I stopped worrying about IO blocking Tornado 18 September 2012
Persistent caching with fire-and-forget updates 14 December 2011