Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

'Tail handling' in general is an annoying aspect of simd. Masks are great, but no panacea--in particular, if you unroll, then you cannot take care of the tail with a single masked instruction. There are various solutions to this. I favour overlapping accesses, where that's feasible (following a great deal of evangelism from mateusz guzik); a colleague uses a variant of duff's device; you can also just generate multiple masks.

I would expect the linked code is just intended as a quick poc, so it does not bother to be optimal.



:) For anyone interested, here's a brief discussion of tail handling: https://github.com/google/highway#strip-mining-loops

(Overlapping is indeed cool where it works - idempotent operations.)


> idempotent

Roughly speaking, while you do need idempotence for reductions, you do not need it for maps. (Of course, plenty of things don't look quite like reductions or maps--or they look like both at once--and each is unique.)


> 'Tail handling' in general is an annoying aspect of simd.

Only for "classical" SIMD, ARM SVE2 show a way to solve this issue cleanly. Not sure where you can use SVE2 though..


Not SVE2, but lemire has a post about using SVE on Amazon Graviton 2 processors: https://lemire.me/blog/2022/07/14/filtering-numbers-faster-w...


Correction, Graviton 3.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: