You have widening operations e.g. 16x16->32 bit multiplications and can reduce n...

jcranmer · on Dec 2, 2021

Given the number of implementations of str* routines in https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/m..., maybe you might want to revisit your last statement. PCMP/MOVMSK work well enough for finding the trailing NUL.

crest · on Dec 3, 2021

Now compare how many different versions of the functions are required for the dozens of possible x86 extensions (and combinations of them) and all the prologue/epilogue code required to watch out for page boundaries and unaligned pointers and as well as the length of the inner loop to handle all the packing/unpacking and cobbeling together horizontal operations to the required masks and turn somehow use them for flow control where needed. It's enough code to put painful pressure on the instruction cache and requires wide OoO superscalar CPU cores to be worth the overhead compare the code in the RISC V vector spec with this strcmp https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/m... and tell me it's a clean and straightforward implementation using the instruction set as intended and not an ugly hack around its limitations.

jcranmer · on Dec 3, 2021

I'm not going to dispute that x86's approach leads to a lot of duplication for each vector size, but your statement was that the fixed-size vector approach isn't "useful for these common functions," which implies to me that it couldn't be used at all.