Great points! It seems useful to add a list based on yours to our readme. Please...

dzaima · on Dec 3, 2022

"dzaima" is how I prefer to be referred to as; but that list is largely me going off of memory, definitely worth double-checking. (and of course, they're ≤AVX2-specific, i.e. x!=y does exist in avx-512 (and clang can do movemask(~(a==b)) → ~movemask(a==b), but gcc won't), and I can imagine truncated narrowing at some point in the future being faster than saturating narrowing on AVX-512; or maybe saturating narrow isn't even better? (for i32→i8, clang emits two xmmword reads whereas _mm256_packs_epi32 → _mm256_packs_epi16 → _mm256_permutevar8x32_epi32({0,4,undef}) can read a ymmword at a time, thus maybe (?) being better on the memory subsystem, but clang decides to rewrite the permd as vextracti128 & vpunpckldq, making it unnecessarily worse in throughput))

janwas · on Dec 6, 2022

Here's a first draft, comments welcome: https://github.com/google/highway/pull/1078.

janwas · on Dec 4, 2022

Got it :) Yes, I'll also check our x86_128 file for #if; those are some of the potholes in SSE4 which are filled by AVX2 or AVX-512.