Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is there to process last <32 elements. The vectorized loop processes up to 32 elements per iteration. The iteration does not happen if there are less than 32 elements left, because it wants to load 32 bytes as input. This is very typical in vectorized loops - process N elements per iteration and second loop that does tail of <N elements.


Though I am curious about why he didn’t switch to using predicated instructions for the tail loop. I’ve switched to that pattern when writing AVX512.


'Tail handling' in general is an annoying aspect of simd. Masks are great, but no panacea--in particular, if you unroll, then you cannot take care of the tail with a single masked instruction. There are various solutions to this. I favour overlapping accesses, where that's feasible (following a great deal of evangelism from mateusz guzik); a colleague uses a variant of duff's device; you can also just generate multiple masks.

I would expect the linked code is just intended as a quick poc, so it does not bother to be optimal.


:) For anyone interested, here's a brief discussion of tail handling: https://github.com/google/highway#strip-mining-loops

(Overlapping is indeed cool where it works - idempotent operations.)


> idempotent

Roughly speaking, while you do need idempotence for reductions, you do not need it for maps. (Of course, plenty of things don't look quite like reductions or maps--or they look like both at once--and each is unique.)


> 'Tail handling' in general is an annoying aspect of simd.

Only for "classical" SIMD, ARM SVE2 show a way to solve this issue cleanly. Not sure where you can use SVE2 though..


Not SVE2, but lemire has a post about using SVE on Amazon Graviton 2 processors: https://lemire.me/blog/2022/07/14/filtering-numbers-faster-w...


Correction, Graviton 3.


Thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: