Using SIMD for Fun

Recently I heard about the Grille problem. This is a very simple problem with a O(n) solution, no problem. But how fast can we solve it? Some people were competing with a 100MB input and having heard about SIMD I thought I would give it a shot. Solve the problem yourself and then we'll go over a simple SIMD solution.


Grille problem


SIMD

SIMD stands for Single-Instruction Multiple-Data. This means that with a single instruction we can modify multiple pieces of data; with AVX2 up to 32 bytes at once. As you might imagine, this can speed things up significantly. Note that my code is in Rust but is very easily translatable to C.


First we need to load our grille data into a SIMD register.


1let dat = _mm256_loadu_si256(grille.as_ptr() as *const __m256i);

We now have 32 bytes of the grille loaded. We want to find which ones are spaces so first we need to load another vector of the characters that we want to compare to, in this case all spaces:


1let mask = _mm256_set1_epi8(b' ' as i8);

Then we perform the comparison:


1let eq = _mm256_cmpeq_epi8(mask, dat);

cmpeq_epi8 compares each byte in the two vectors and if they are equal outputs 0xff, otherwise 0. Now eq is a list of 32 booleans. SIMD operations are somewhat limited so at this point we convert back to normal workable data:


1let mut pos = _mm256_movemask_epi8(eq) as u32;

movemask_epi8 converts our boolean vector to a 32-bit integer where each bit corresponds to a byte in our vector. Each bit is set if its corresponding byte was true.


Now we have a 32-bit integer where each bit represents an index into our grid. If the bit is set we want to grab the corresponding character from the grid and output it. The naive way is to check if pos & 1 == 1 for each bit, but we can do better. Instead we count the number of trailing zeros.


1let x = pos.trailing_zeros();

Trailing zeros uses the TZCNT instruction, part of SSE4. It gives us the number of zeros until the first 1, which is the same as giving us the index of the first set bit. This index is the index of the first character we need to output. Then we simply unset that bit with


1pos = pos ^ (1 << x);

and continue until pos == 0. This way we check no unnecessary bits.


Results

My naive version took 0.45 seconds on my machine. This SIMD version took 0.14 seconds. That might not seem like a ton but consider running it 100 or 1000 times. Big Data is always being talked about and this type of speedup saves not only time, but a lot of money on cloud services. I actually found that this code was faster than using memchr so don't think that you can't beat off-the-shelf tools. I was very satisfied with this for a first try at using SIMD.


There are actually even faster ways to do this. After I posted my solution, someone else wrote a C solution that does even more SIMD processing. In addition, with AVX512 there is an instruction, _mm256_mask_compressstoreu_epi8 which allows you to write only the needed grid data with a single instruction, removing the need for our loop. AVX512 also allows you to operate on up to 64 bytes at once. Unfortunately very few processors support it and I've heard that it can be slow.


Shameless Advert

I am looking for work. If you liked this article, consider hiring me (preferably remote). You can view my (slightly redacted) resume here. If I don't find work soon, I will be faced with the choice of starving to death or giving up on programming FOREVER and going back to work as a janitor. Is that what you want? I have many fun things coming soon that you don't want to miss. I can even invert a binary tree!