Yes, I'm aware of the math, but there are other aspects like 'does this form of the code help auto-vectorisation?', 'does it work across both gcc and clang?', 'what about optimisation levels', 'does my hand-rolled version beat the compiler's optimisation of the naive code?'. Etc. It's a lot of fun.