Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I switched from Intel to AMD Zen 4 over this.

Whether or not AVX 512 is the future, Intel's handling of this left a sour taste for me, and I'd rather be future-proof in case it does gain traction, since I do use a CPU for many years before building a new system. Intel's big/little cores (with nothing else new of note compared to 5+ years ago) offer nothing that future-proofs my workflows. 16 cores of equally performant power with the latest instruction sets does.



I very recently upgraded and had considered both Raptor Lake and Zen 4 CPU options, ultimately going with the latter due to, among other considerations, the AVX-512 support.

Future proofing is no doubt a valid consideration, but some of these benefits are already here today. For example, in a recent Linux distro (using glibc?), attach a debugger to just about any process and break on something like memmove or memset in libc, and you can see some AVX-512 code paths will be taken if your CPU supports these instructions.


Have you been programming with the Zen 4? I bought one, and I've been using the avx512 intrinsics via C++ and Rust (LLVM Clang for both), and I've been a little underwhelmed by the performance. Like say using Agner Fog's vector library, I'm getting about a 20% speedup going from avx2 to avx512. I was hoping for 2x.


That's because Zen 4 runs AVX-512 by breaking them up over two cycles. Zen 4 AVX-512 is "double pumped".


Not really. The "pure" computations are double pumped but some of the utility instructions you use with the computations are native AVX-512. And there has been a lot of analysis out there about this and AFIK the conclusion is that outside of very artificial benchmarks most (not all) applications of AVX-512 will never saturate the pipeline enough for it (being double pumped) to matter (i.e. due to how speculative execution, instruction pipeling etc. work in combination with a few relevant instructions being native AVX-512).

Even more so for common mixed workloads the double pumped implementation can even be the better choice as using it puts less constraints on clock speed and what the CPU can do in parallel with it internally.

Sure if you only look at benchmarks focused on benchmarks which only care about (for most people) unrealistic usage (like this article also pointed out many do) your conclusions might be very different.


I think the notion of double pumping is only in the VMOV operation. Looking at Agner[1], the rest of the instructions have similar Ops/Latency to their avx2 counterparts.

[1] https://www.agner.org/optimize/instruction_tables.pdf


What you want to look at in that table is reciprocal throughput, which is almost everywhere doubled for 512-bit wide instructions.


I think you're right, but I've passed the hacker news edit threshold. May my misinformation live on forever.


I'm guessing that 20% is still enough for your zen4 to be faster than raptor lake running the avx2 path, while also probably using less power.


no you want to look at benchmarks of realistic real word applications

pure throughput doesn't matter if in most realistic use cases you will never reach it


I use it via the MKL and https://github.com/vectorclass/

I think they have very efficient pipelines.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: