I love me some isa extension, I'd love to know what these are intended and useful for for though. 1 bit inference? I hear they could be useful in crypto as well, but that's out of my field.
While you may be correct in the sense that, in a public acquisition statement, people should be inferring enormous context and not taking anything said at face value.
It's simultaneously true that this is the farthest thing from effective, honest, and clear communication. Reading between the lines here is required precisely because we all know that any acquisition statements made are, at best heavily coded, if not completely just fluff.
You can recognize that and still get angry that it's par for the course for such things to be not just devoid of useful information, but often actively deceiving.
Tbf, and in support of your broader point, there's no reading between the lines, because genuine intent is indistinguishable from deception with this kind of stuff, because the latter imitates the former. There's only expecting the worst, and being only occasionally wrong.
You'd be surprised, even by navigating in this comment section. I guess they continue to do it because it works. Or because they no longer care about public sentiment.
Fort what it's worth, I had the exact same experience you did when I started writing SIMD code explicitly with intrinsics.
I avoided it for a long time because, well, it was so damn ugly and verbose to do simple things. However, in actual practice it's not nearly as painful as it looks, and you get used to it quickly.
which honestly, shouldn't be neccessary today with avx512. There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput, while if it_is_ aligned you will get the same performance as the aligned-only load.
No reason for the compiler to balk at vectorizing unaligned data these days.
> There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput
Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...
There are many situations where your data is essentially _majority_ unaligned. Considerable effort by the hardware guys has gone into making that situation work well.
A great example would be a convolution-kernel style code - with AVX512 you are using 64 bytes at a time (a whole cacheline), and sampling a +- N element neighborhood around a pixel. By definition most of those reads will be unaligned!
A lot of other great use cases for SIMD don't let you dictate the buffer alignment. If the code is constrained by bandwidth over compute, I have found it to be worth doing a head/body/tail situation where you do one misaligned iteration before doing the bulk of the work in alignment, but honestly for that to be worth it you have to be working almost completely out of L1 cache which is rare... otherwise you're going to be slowed down to L2 or memory speed anyways, at which point the half rate penalty doesn't really matter.
The early SSE-style instructions often favored making two aligned reads and then extracting your sliding window from that, but there's just no point doing that on modern hardware - it will be slower.
Even with AVX512, memory arguments used in most instructions (those that are not explicitly unaligned loads) need to be aligned, no? E.g., for vaddps zmm0, zmm0, [rdi] (saving a register and an instruction over vmovups + vaddps reg, reg, reg), rdi must be suitably aligned.
Apart from that, there indeed hasn't been a real unaligned (non-atomic) penalty on Intel since Nehalem or something. Although there definitely is an extra cost for crossing a page, and I would assume also a smaller one for crossing a cache line—which is quite relevant when your ops are the same size as one!
With the older microarchitectures there was a large penalty for crossing a cache line with AVX-512. In some cases, the performance could be worse than AVX2!
In older microarchitectures like Ice Lake it was pretty bad, so you wanted to avoid unaligned loads if you could. This penalty has rapidly shrunk across subsequent generations of microarchitectures. The penalty is still there but on recent microarchitectures it is small enough that the unaligned case often isn't a showstopper.
The main reason to use aligned loads in code is to denote cases where you expect the address to always be aligned i.e. it should blow up if it isn't. Forcing alignment still makes sense if you want predictable, maximum performance but it isn't strictly necessary for good performance on recent hardware in the way it used to be.
AVX doesn't require alignment of any memory operands, with the exception of the specific load aligned instruction. So you/the compiler are free to use the reg,mem form interchangibly with unaligned data.
The penalty on modern machines is an extra cycle of latency and, when crossing a cacheline, half the throughput (AVX512 always crosses a cacheline since they are cacheline sized!). These are pretty mild penalties given what you gain! So while it's true that peak L1 cache performance is gained when everything is aligned.. the blocker is elsewhere for most real code.
I've run into this as well. Problem is that linear RGB is most definitely not a perceptually uniform space, so blending in it frequently does something different than you want. Use linear for physically based light and mixing, but if you are modeling an operation that is based on human perception it is going to be completely wrong.
The dark irony then, is that sRGB with its gamma curve applied, models luminance better (closer to human perception) for blending than linear does. If you can afford to do the blend in a perceptually uniform space like oklab, even better of course.
Yes and no. I think neon is undersized for today at 128bit registers -- if you're working with doubles for example, that's only two values per register, which is pretty anemic. Things like shuffles and other tricky bitops benefit from wider widths as well (see my other reply)
Agreed that 128 bit is undersized, but 512 feels pretty good for the time being. We're unlikely to see further size increases since going to 1024 would require doubling the cache line, register file, and ram bandwidth, while just adding an extra fma port is far less hardware.
totally - especially given how bandwidth constrained CPUs still are, going wider than 512 doesn't make much sense. 512 itself was a stretch for quite a long time (and all the negative press on the original implementations was a consequence of being not-quite-ready for primetime), but for current hardware I think it's perfect.
But 128bit is just ancient. If you're going to go to significant trouble to rewrite your code in SIMD, you want to at least get a decent perf return on investment!