> I guess the compiler's unrolling heuristics generally aren't as good as that blocking "mod then div" alternative to Duff's device? Obviously taking `s` out of the loop condition is part of the magic.
The magic in this case is the compiler autovectorizer. Making the length of the loop a loop invariant allows the autovectorizer to kick in.
The reason "blocking" by accumulating on uint8_t helps further is that it allows the compiler to accumulate on 8 bit SIMD lanes, instead 32 bit SIMD lanes.
The same operation on 8 bit SIMD lanes will, to a first approximation, do x4 the work per cycle.
The magic in this case is the compiler autovectorizer. Making the length of the loop a loop invariant allows the autovectorizer to kick in.
The reason "blocking" by accumulating on uint8_t helps further is that it allows the compiler to accumulate on 8 bit SIMD lanes, instead 32 bit SIMD lanes. The same operation on 8 bit SIMD lanes will, to a first approximation, do x4 the work per cycle.