Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I guess the compiler's unrolling heuristics generally aren't as good as that blocking "mod then div" alternative to Duff's device? Obviously taking `s` out of the loop condition is part of the magic.

The magic in this case is the compiler autovectorizer. Making the length of the loop a loop invariant allows the autovectorizer to kick in.

The reason "blocking" by accumulating on uint8_t helps further is that it allows the compiler to accumulate on 8 bit SIMD lanes, instead 32 bit SIMD lanes. The same operation on 8 bit SIMD lanes will, to a first approximation, do x4 the work per cycle.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: