Beating your CLI tools with 30 lines of Rust

You may remember the old article Command-line Tools can be 235x Faster than your Hadoop Cluster. Recently I was thinking about parallelizing a piece of software in Rust. I revisited this article and wondered if Rust could easily match or improve upon this speed. With 30 lines of code I was easily able to outperform the Bash pipeline that Adam Drake constructed, ending up 1.5x faster. For reference, my CPU is a Ryzen 5900x. Here is the original pipeline reproduced:


1find . -type f -name '*.pgn' -print0 | xargs -0 -n4 -P4 mawk '/Result/ { split($0, a, "-"); res = substr(a[1], length(a[1]), 1); if (res == 1) white++; if (res == 0) black++; if (res == 2) draw++ } END { print white+black+draw, white, black, draw }' | mawk '{games += $1; white += $2; black += $3; draw += $4; } END { print games, white, black, draw }'

Make sure that you change the arguments to xargs to match the number of cores in your machine (who would forget that? Couldn't be me).


I don't have home Internet, so I was unable to use real data. Instead I took the below snippet from the article and used this to generate 300 files of 100,000 games each with a random result for each game. This amounted to 4.9GB of data to process.


1[Event "F/S Return Match"]
2[Site "Belgrade, Serbia Yugoslavia|JUG"]
3[Date "1992.11.04"]
4[Round "29"]
5[White "Fischer, Robert J."]
6[Black "Spassky, Boris V."]
7[Result "1/2-1/2"]

The real data would consist of a lot more lines of data (moves from the game) that we don't care about and would be filtered out. I don't believe this effects the results but someone with an Internet connection can try it on real data and berate me if I'm wrong.


Without further ado, here is the Rust code in all of its glory.


1extern crate rayon;
2
3use std::fs;
4use std::io::{BufReader, BufRead};
5use std::sync::atomic::{AtomicUsize, Ordering};
6
7use rayon::prelude::*;
8
9fn main() {
10 let white = AtomicUsize::new(0);
11 let black = AtomicUsize::new(0);
12 let draw = AtomicUsize::new(0);
13 let _ = fs::read_dir("data").unwrap().par_bridge()
14 .for_each(|x| {
15 let f = fs::File::open(x.unwrap().path()).unwrap();
16 let r = BufReader::new(f);
17 r.lines()
18 .par_bridge()
19 .filter_map(|line| {
20 let l = line.unwrap();
21 if &l.as_bytes()[0..8] == b"[Result " {
22 Some(l)
23 } else {
24 None
25 }
26 })
27 .for_each(|l| {
28 let l = l.split('-').next().unwrap().as_bytes();
29 match l[l.len()-1] {
30 b'0' => black.fetch_add(1, Ordering::Relaxed),
31 b'1' => white.fetch_add(1, Ordering::Relaxed),
32 b'2' => draw.fetch_add(1, Ordering::Relaxed),
33 _ => unreachable!(),
34 };
35 })
36 });
37 let white = white.into_inner();
38 let black = black.into_inner();
39 let draw = draw.into_inner();
40 println!("{}, {}, {}, {}", white+black+draw, white, black, draw);
41}

It's really straight forward; I did no optimization, I simply hocked the code into my editor (Neovim btw). I do things in the same way as the Bash pipeline: grab a list of files, read them and grab the Result lines, split these lines at the hyphen, match over the preceding character and increment the corresponding counter. The one optimization that I did as I was writing is here:


1if &l.as_bytes()[0..8] == b"[Result " {

I noticed that [Result is 8 bytes, meaning that it will fit into a u64. Therefore, this comparison is a single instruction versus comparing byte-by-byte. Actually, it seems that Rust is smart enough to do this without converting to a byte array but it involved more instructions and a jump. (Also, a reminder that slicing strings is not safe in Rust).


Rayon does all of the heavy lifting here. In another language I think I would have had a much harder time.


So, what should you use for performing this type of analysis? As always, it depends on what you're doing and what you know. This took me about 5 minutes to write, but I know Rust pretty well. Building a bash pipeline would have taken me much longer (skill issue). While the Rust version was 1.5x faster, it was the difference between 1.242 seconds and 0.821 seconds. For a one-off run it doesn't matter at all, but if this computation was being performed regularly this could be a big gain. Either way, I think this is a good reminder to be aware of what tools are at your disposal. I could easily run into a problem where using Rayon would be too difficult or would require writing much more code than simply using a few CLI tools.


I spent a little time trying to optimize this, including attempting SIMD, but the only things that helped were switching to Vec<u8> instead of String and removing the internal parallelization. This brought things to ~0.550 seconds or 2.25x faster than the shell pipeline.


As an aside, I think that if you are developing a new language having something like Rayon built in will be a huge boon. Imagine a language like python which is capable of easily or automagically parallelizing your data analysis, etc. I think that people turn to Bash not because it's good, but because it's the easiest option for streaming programs. A language which is just as capable but features a consistent syntax could be a huge win here.


Shameless Advert

I am looking for work. If you liked this article, consider hiring me (preferably remote). You can view my (slightly redacted) resume here. If I don't find work soon, I will be faced with the choice of starving to death or giving up on programming FOREVER and going back to work as a janitor. Is that what you want? I have many fun things coming soon that you don't want to miss. I can even invert a binary tree!