OK, it would indeed be nice if we only had to target AVX-512, but that's not the reality I'm in.
On minimizing required emulation - any thoughts as to how? Padding data structures seems to be the biggest win, only mask/gather/scatter where unavoidable, anything else?
Stay within 16-byte lanes for many things (≤16-bit element shuffles, truncation, etc); use saturating when narrowing types if possible; try to stay to a==b and signed a>b by e.g. moving negation elsewhere or avoiding unsigned types; switch to a wider element type if many operations in a sequence aren't supported on the narrower one (or, conversely, stay to the narrower type if only a few ops need a wider one). Some of these may be mitigated by sufficiently advanced compilers, but they're quite limited currently.
Great points! It seems useful to add a list based on yours to our readme.
Please let me know if you'd like us to acknowledge you in the commit message with anything other than the username dzaima.
"dzaima" is how I prefer to be referred to as; but that list is largely me going off of memory, definitely worth double-checking. (and of course, they're ≤AVX2-specific, i.e. x!=y does exist in avx-512 (and clang can do movemask(~(a==b)) → ~movemask(a==b), but gcc won't), and I can imagine truncated narrowing at some point in the future being faster than saturating narrowing on AVX-512; or maybe saturating narrow isn't even better? (for i32→i8, clang emits two xmmword reads whereas _mm256_packs_epi32 → _mm256_packs_epi16 → _mm256_permutevar8x32_epi32({0,4,undef}) can read a ymmword at a time, thus maybe (?) being better on the memory subsystem, but clang decides to rewrite the permd as vextracti128 & vpunpckldq, making it unnecessarily worse in throughput))
On minimizing required emulation - any thoughts as to how? Padding data structures seems to be the biggest win, only mask/gather/scatter where unavoidable, anything else?