Rust > Go > Python ...to parse millions of dates in CSV files

Tuesday, May 15, 2018

⬅︎ Back to Rust > Go > Python ...to parse millions of dates in CSV files

Comment

Ted Mielczarek May 16, 2018

Some notes on my changes to mythmon's code. First of all, I didn't make any changes to the single_threaded code. The only speedup there came from me tweaking the optimizer setttings.

Changing the optimized build settings to enable LTO is one of the simplest things to try to improve performance on Rust code. LTO (link-time optimization) is an LLVM feature that lets the optimizer do some more interesting things, and can often provide performance and code size wins at the expense of slightly longer build times:
https://github.com/luser/rust-gz-csv-test/commit/952b2e503e9301c7a6b3ba9dcfc264f117191713#diff-80398c5faae3c069e4e6aa2ed11b28c0R25

I should have profiled the binary at this point, but I jumped ahead to assuming that I could speed things up by not using the csv crate's `Reader::deserialize` method. serde is amazing and makes for very readable code, but since the `Row` struct has `String` members that means that each of those adds a memory allocation for each row. I used the `Reader::read_byte_record` method instead, which does a very simple parse of each CSV line and allows you to get the raw bytes for each field, then parsed only the date field, since that was the only field being used:
https://github.com/luser/rust-gz-csv-test/commit/952b2e503e9301c7a6b3ba9dcfc264f117191713#diff-5d8025f9232930e1be589edfd0704015R62

After that change I did profile the binary (using `perf record` on Linux) and found that the hottest function was the DateTime parsing code in the chrono crate. I looked around for some other options but I couldn't find anything that fit the bill so I used the nom crate to write a very simple rfc3339 DateTime parser:
https://github.com/luser/rust-gz-csv-test/commit/43ee6c932c5773674fec5ede89c6749b0e0d2e60#diff-5d8025f9232930e1be589edfd0704015R31

I'm sure the parser isn't entirely spec-compliant, and I wouldn't use it in production code that had to accept arbitrary input, but it worked well enough for this constrained case and nom makes parsers like this extremely easy to write so it was not that bad! After implementing that I re-profiled the binary and found that the hottest functions were doing gzip decompression and the actual core csv parsing which feels entirely reasonable and a good place to stop.

As an aside, I did take a step back and rewrite the code to use serde again, but using borrowed data (&str) instead of owned data (String), which avoids the allocation overhead, but also still using my custom DateTime parser:
https://github.com/luser/rust-gz-csv-test/commit/09d0f01adc02ca43968d7dfdf4c17a16731f6165

This was slightly slower than my fastest version but still considerably faster than most other versions and it seems more readable than the fastest version I wrote. If I had to support this code for real this would probably be the version I would choose unless that little bit of extra performance was actually important.