It is there to process last <32 elements. The vectorized loop processes up to 32...

sakras · on Sept 15, 2022

Though I am curious about why he didn’t switch to using predicated instructions for the tail loop. I’ve switched to that pattern when writing AVX512.

moonchild · on Sept 15, 2022

'Tail handling' in general is an annoying aspect of simd. Masks are great, but no panacea--in particular, if you unroll, then you cannot take care of the tail with a single masked instruction. There are various solutions to this. I favour overlapping accesses, where that's feasible (following a great deal of evangelism from mateusz guzik); a colleague uses a variant of duff's device; you can also just generate multiple masks.

I would expect the linked code is just intended as a quick poc, so it does not bother to be optimal.

janwas · on Sept 15, 2022

:) For anyone interested, here's a brief discussion of tail handling: https://github.com/google/highway#strip-mining-loops

(Overlapping is indeed cool where it works - idempotent operations.)

moonchild · on Sept 16, 2022

> idempotent

Roughly speaking, while you do need idempotence for reductions, you do not need it for maps. (Of course, plenty of things don't look quite like reductions or maps--or they look like both at once--and each is unique.)

renox · on Sept 15, 2022

> 'Tail handling' in general is an annoying aspect of simd.

Only for "classical" SIMD, ARM SVE2 show a way to solve this issue cleanly. Not sure where you can use SVE2 though..

kristianp · on Sept 17, 2022

Not SVE2, but lemire has a post about using SVE on Amazon Graviton 2 processors: https://lemire.me/blog/2022/07/14/filtering-numbers-faster-w...

kristianp · on Sept 23, 2022

Correction, Graviton 3.

ItsRainingBits · on Sept 14, 2022

Thank you!