Memcpy and memset are massively parallel operations used on a CPU all the time.
But lets ignore the _easy_ problems. AES-GCM mode is massively parallelized as well, each 128-bit block of AES-GCM can run in parallel, so AVX512-AES encryption can process 4 blocks in parallel per clock tick.
Icelake and later CPUs have a REP MOVS / REP STOS implementation that is generally optimal for memcpy and memset, so there’s no reason to use AVX512 for those except in very specific cases.
I know when I use GCC to compile with AVX512 flags, it seems to output memcpy as AVX registers / ZMMs and stuff...
Auto vectorization usually sucks for most code. But very simple setting of structures / memcpy / memset like code is ideal for AVX512. It's a pretty common use case (think a C++ vector<SomeClass> where the default constructor sets the 128 byte structure to some defaults)
AVX512 doesn't itself imply Icelake+; the actual feature is FSRM (fast short rep movs), which is distinct from AVX512. In particular, Skylake Xeon and Cannon Lake, Cascade Lake, and Cooper Lake all have AVX512 but not FSRM, but my expectation is that all future architectures will have support, so I would expect memcpy and memset implementations tuned for Icelake and onwards to take advantage of it.
Memcpy and memset are massively parallel operations used on a CPU all the time.
But lets ignore the _easy_ problems. AES-GCM mode is massively parallelized as well, each 128-bit block of AES-GCM can run in parallel, so AVX512-AES encryption can process 4 blocks in parallel per clock tick.
Linus is just somehow ignorant of this subject...