OK, it would indeed be nice if we only had to target AVX-512, but that's not the...

dzaima · on Dec 2, 2022

Stay within 16-byte lanes for many things (≤16-bit element shuffles, truncation, etc); use saturating when narrowing types if possible; try to stay to a==b and signed a>b by e.g. moving negation elsewhere or avoiding unsigned types; switch to a wider element type if many operations in a sequence aren't supported on the narrower one (or, conversely, stay to the narrower type if only a few ops need a wider one). Some of these may be mitigated by sufficiently advanced compilers, but they're quite limited currently.

janwas · on Dec 3, 2022

Great points! It seems useful to add a list based on yours to our readme. Please let me know if you'd like us to acknowledge you in the commit message with anything other than the username dzaima.

dzaima · on Dec 3, 2022

"dzaima" is how I prefer to be referred to as; but that list is largely me going off of memory, definitely worth double-checking. (and of course, they're ≤AVX2-specific, i.e. x!=y does exist in avx-512 (and clang can do movemask(~(a==b)) → ~movemask(a==b), but gcc won't), and I can imagine truncated narrowing at some point in the future being faster than saturating narrowing on AVX-512; or maybe saturating narrow isn't even better? (for i32→i8, clang emits two xmmword reads whereas _mm256_packs_epi32 → _mm256_packs_epi16 → _mm256_permutevar8x32_epi32({0,4,undef}) can read a ymmword at a time, thus maybe (?) being better on the memory subsystem, but clang decides to rewrite the permd as vextracti128 & vpunpckldq, making it unnecessarily worse in throughput))

janwas · on Dec 6, 2022

Here's a first draft, comments welcome: https://github.com/google/highway/pull/1078.

janwas · on Dec 4, 2022

Got it :) Yes, I'll also check our x86_128 file for #if; those are some of the potholes in SSE4 which are filled by AVX2 or AVX-512.