Recently I heard about the Grille problem. This is a very simple problem with a O(n) solution, no problem. But how fast can we solve it? Some people were competing with a 100MB input and having heard about SIMD I thought I would give it a shot. Solve the problem yourself and then we'll go over a simple SIMD solution.
SIMD stands for Single-Instruction Multiple-Data. This means that with a single instruction we can modify multiple pieces of data; with AVX2 up to 32 bytes at once. As you might imagine, this can speed things up significantly. Note that my code is in Rust but is very easily translatable to C.
First we need to load our grille data into a SIMD register.
1 | let dat = _mm256_loadu_si256(grille.as_ptr() as *const __m256i); |
We now have 32 bytes of the grille loaded. We want to find which ones are spaces so first we need to load another vector of the characters that we want to compare to, in this case all spaces:
1 | let mask = _mm256_set1_epi8(b' ' as i8); |
Then we perform the comparison:
1 | let eq = _mm256_cmpeq_epi8(mask, dat); |
cmpeq_epi8
compares each byte in the two vectors and
if they are equal outputs 0xff, otherwise 0. Now eq
is a list of 32 booleans. SIMD operations are somewhat limited so
at this point we convert back to normal workable data:
1 | let mut pos = _mm256_movemask_epi8(eq) as u32; |
movemask_epi8
converts our boolean vector to a 32-bit
integer where each bit corresponds to a byte in our vector. Each bit
is set if its corresponding byte was true.
Now we have a 32-bit integer where each bit represents an index into
our grid. If the bit is set we want to grab the corresponding character
from the grid and output it. The naive way is to check if pos & 1 == 1
for each bit, but we can do better. Instead we count the number of trailing zeros.
1 | let x = pos.trailing_zeros(); |
Trailing zeros uses the TZCNT
instruction, part of SSE4.
It gives us the number of zeros until the first 1, which is the same
as giving us the index of the first set bit. This index is the index
of the first character we need to output. Then we simply unset that
bit with
1 | pos = pos ^ (1 << x); |
and continue until pos == 0
. This way we check no unnecessary
bits.
My naive version took 0.45 seconds on my machine. This SIMD version took 0.14 seconds. That might not seem like a ton but consider running it 100 or 1000 times. Big Data is always being talked about and this type of speedup saves not only time, but a lot of money on cloud services. I actually found that this code was faster than using memchr so don't think that you can't beat off-the-shelf tools. I was very satisfied with this for a first try at using SIMD.
There are actually even faster ways to do this. After I posted my solution,
someone else wrote a C solution that does even more SIMD processing. In
addition, with AVX512 there is an instruction, _mm256_mask_compressstoreu_epi8
which allows you to write only the needed grid data with a single instruction,
removing the need for our loop. AVX512 also allows you to operate on up
to 64 bytes at once. Unfortunately very few processors support it and
I've heard that it can be slow.
I am looking for work. If you liked this article, consider hiring me (preferably remote). You can view my (slightly redacted) resume here. If I don't find work soon, I will be faced with the choice of starving to death or giving up on programming FOREVER and going back to work as a janitor. Is that what you want? I have many fun things coming soon that you don't want to miss. I can even invert a binary tree!