UH... didn't *Intel* actually stop including AVX 512 in their newest (alder lake...

Aardwolf · on Nov 28, 2022

I switched from Intel to AMD Zen 4 over this.

Whether or not AVX 512 is the future, Intel's handling of this left a sour taste for me, and I'd rather be future-proof in case it does gain traction, since I do use a CPU for many years before building a new system. Intel's big/little cores (with nothing else new of note compared to 5+ years ago) offer nothing that future-proofs my workflows. 16 cores of equally performant power with the latest instruction sets does.

ls65536 · on Nov 28, 2022

I very recently upgraded and had considered both Raptor Lake and Zen 4 CPU options, ultimately going with the latter due to, among other considerations, the AVX-512 support.

Future proofing is no doubt a valid consideration, but some of these benefits are already here today. For example, in a recent Linux distro (using glibc?), attach a debugger to just about any process and break on something like memmove or memset in libc, and you can see some AVX-512 code paths will be taken if your CPU supports these instructions.

kolbe · on Nov 28, 2022

Have you been programming with the Zen 4? I bought one, and I've been using the avx512 intrinsics via C++ and Rust (LLVM Clang for both), and I've been a little underwhelmed by the performance. Like say using Agner Fog's vector library, I'm getting about a 20% speedup going from avx2 to avx512. I was hoping for 2x.

undersuit · on Nov 28, 2022

That's because Zen 4 runs AVX-512 by breaking them up over two cycles. Zen 4 AVX-512 is "double pumped".

dathinab · on Nov 29, 2022

Not really. The "pure" computations are double pumped but some of the utility instructions you use with the computations are native AVX-512. And there has been a lot of analysis out there about this and AFIK the conclusion is that outside of very artificial benchmarks most (not all) applications of AVX-512 will never saturate the pipeline enough for it (being double pumped) to matter (i.e. due to how speculative execution, instruction pipeling etc. work in combination with a few relevant instructions being native AVX-512).

Even more so for common mixed workloads the double pumped implementation can even be the better choice as using it puts less constraints on clock speed and what the CPU can do in parallel with it internally.

Sure if you only look at benchmarks focused on benchmarks which only care about (for most people) unrealistic usage (like this article also pointed out many do) your conclusions might be very different.

kolbe · on Nov 28, 2022

I think the notion of double pumping is only in the VMOV operation. Looking at Agner[1], the rest of the instructions have similar Ops/Latency to their avx2 counterparts.

[1] https://www.agner.org/optimize/instruction_tables.pdf

Denvercoder9 · on Nov 28, 2022

What you want to look at in that table is reciprocal throughput, which is almost everywhere doubled for 512-bit wide instructions.

kolbe · on Nov 29, 2022

I think you're right, but I've passed the hacker news edit threshold. May my misinformation live on forever.

celrod · on Nov 29, 2022

I'm guessing that 20% is still enough for your zen4 to be faster than raptor lake running the avx2 path, while also probably using less power.

dathinab · on Nov 29, 2022

no you want to look at benchmarks of realistic real word applications

pure throughput doesn't matter if in most realistic use cases you will never reach it

kolbe · on Nov 29, 2022

I use it via the MKL and https://github.com/vectorclass/

I think they have very efficient pipelines.

dragontamer · on Nov 28, 2022

Yeah but AMD Zen4 has AVX512 enabled by default. And they're not too shabby with the performance either.

AVX512 is the future, but maybe not Intel's future, ironically.

CamperBob2 · on Nov 28, 2022

If I need to crunch small amounts of data in a hurry, existing instructions are fine for that.

If I need to crunch large amounts of data in a hurry, I'll send it to a GPU. The CPU has no chance to compete with that.

I honestly don't understand who/what AVX512 is really for, other than artificial benchmarks that are intentionally engineered to depend on AVX512 far more often than any real-world application would.

dragontamer · on Nov 28, 2022

> If I need to crunch large amounts of data in a hurry, I'll send it to a GPU. The CPU has no chance to compete with that.

It takes literally 1 to 10 microseconds (10,000 nanoseconds) to talk to a GPU over PCIe.

In the 40,000 clock cycles, you could have processed 2.5 MB of data with AVX512 instructions *BEFORE* the GPU even is aware that you're talking to it. Then you gotta start passing the data to the GPU, the GPU then has to process the data, and then it has to send it all back.

All in all, SIMD instructions on CPU-side are worthwhile for anything less than 8MB for sure, maybe less than 16MB or 32MB, depending on various details.

----------

That's one core. If your one-core machine can talk to the other 32-cores or 128-cores of your computer (see 64-core EPYC dual-socket computers), you can communicate to another core in just 50-nanoseconds or so (~200 clock penalty), and those other cores can be processing AVX512 as well.

So if you're able to use parallel programming on a CPU, its probably closer to 1GB+ of data before it actually is truly an 'obvious' choice to talk to the GPU, rather than just keeping it on CPU-only side.

---------

Example of practical use: AES-GCM can be processed using AVX512 in parallel (each AES-GCM block is a parallel instance, the entire AES-GCM stream is in parallel), but no one will actually use GPUs to process this... because AES-instructions are single-clock tick (or faster!!) on modern CPUs like Intel / AMD Zen.

That's just going to happen whenever you go to an TLS1.2 or HTTPS instance, which is pretty much all the time? Like, every single byte coming out of Youtube is over HTTPS these days and needs to be decrypted before further processing.

nomel · on Nov 28, 2022

Is it safe to say that the true future is low latency GPUs? :)

saagarjha · on Nov 29, 2022

The future is blurring the line between what a GPU and a CPU is.

bXVsbGVy · on Nov 28, 2022

> It takes literally 1 to 10 microseconds (10,000 nanoseconds) to talk to a GPU over PCIe.

The CPU has an overhead of about ~10us to enable the AVX512 units.

It also dramatically reduces the clock on other cores.

For more information, see: https://stackoverflow.com/a/56861355

Timing information: https://www.agner.org/optimize/microarchitecture.pdf

dragontamer · on Nov 28, 2022

Skylake-X is a processor that's 7 years old. Intel's first implementation always kinda sucks, but the newer implementations have no such restrictions.

Its all about AMD Zen4 or Xeon Ice Lake+, which has no clock reduction and no overheads.

bXVsbGVy · on Nov 28, 2022

From microarchitecture.pdf

On Alder Lake (pg 172).

> The reader is referred to the timings for Tiger Lake and Gracemont.

On Tiger Lake (pg 167):

> Warm-up period for ZMM vector instructions

> The processor puts the upper parts of the 512 bit vector execution units into a low power mode when they are not used.

> Instructions with 512-bit vectors have a throughput that is approximately 4.5 times slower than normal during an initial warm-up period of approximately 50,000 clock cycles.

I'm not saying you are wrong. I just haven't heard about that.

dragontamer · on Nov 28, 2022

https://www.mersenneforum.org/showthread.php?p=614191

> Since 512-bit instructions are reusing the same 256-bit hardware, 512-bit does not come with additional thermal issues. There is no artificial throttling like on Intel chips.

At least for Zen4, there's no worries about throttling or anything really. Its the same AVX hardware, "double pumped" (two 256-bit micro-instructions output per single 512-bit instruction). But you still save significantly on the decoder (ie: the "other" hyperthread can use the decoder in the core to keep executing its scalar code at full speed, since your hyperthread is barely executing any instructions)

sitkack · on Nov 28, 2022

This is why on-die crypto accelerators make so much sense.

janwas · on Nov 28, 2022

Hopefully they're updated for the new post-quantum algorithms.

Which would you rather have: some fixed-function unit shared between all cores (load balancing? what if you're suddenly doing crypto stuff on many cores?), or the general-purpose tools for running any algorithm on any core?

dathinab · on Nov 29, 2022

AES isn't really threatened by quantum computing AFIK

And most encryption is to use "something" to get an AES key and then us that to decrypt data.

And that "something" (e.g. RSA/ECC based approaches) is what is threatened by quantum computing. But it's also not overly problematic if that "something" becomes slower to compute as it's done "only" once per-"a bunch of data".

AFIK the situation is similar for signing, as you don't sign the data itself normally but instead sign a hash of it and I think the hashing algorithms mostly used are not threatened by quantum computing either.

branc116 · on Nov 29, 2022

> AES isn't really threatened by quantum computing

NSA: "You are correct!"

janwas · on Nov 29, 2022

hm, which AES? AES-128 is getting a bit tight already for multi-target attacks.

As to quantum, it looks like practical serial or parallel application of Grover's algorithm might still be decades away. But that is with current knowledge, and who knows what other breakthroughs will be made.

sitkack · on Nov 29, 2022

Wrt algorithms, it really is an implementation detail of how flexible they make the crypto engines.

The number of crypto units would be SKU specific depending on the workload. A server box doing service mesh would need one per concurrent flow presumably.

The thing that accelerator offload gets you is an additional thermal budget to spend on general purpose workloads. If you know that you will be doing SERDES, and enc/dec, offloading those to an accelerator frees up watts of TDP (thermal design power) to spend other places. This is also why we see big/little architectures, the OOO processors suck up a lot of power. In-order cores are just fine for latency insensitive workloads.

pmalynin · on Nov 29, 2022

Most crypto beyond the initial key exchange is symmetric (AES, ChaCha) and those are still resistant to quantum attacks (well beyond the brute force search speed ups). Post-quantum key exchange is fast enough that it doesn't need dedicated units, unless you're in some super constrained environments. But in that case you'll run into other issues too.

sitkack · on Nov 28, 2022

> If I need to crunch large amounts of data in a hurry, I'll send it to a GPU. The CPU has no chance to compete with that.

PCIe is real bottleneck that affects bandwidth and latency. It only really works if your data is already resident on the GPU, your kernels are fixed and the amount of data returned is small.

With HBM and VCache, we will see main memory bandwidth over 1TB/s for consumer high perf cpus in the near future, at those rates, GPUs won't make sense. GPUs are basically ASICs that can take advantage of hundreds of GB/s of memory bandwidth, when the CPU can do scans at that same rate, the necessity of a GPU is greatly reduced.

ilyt · on Nov 28, 2022

If you look at what AVX512 is often used (as the article mentions), its less of math and more of just speeding up processing of deserializers and various other things that can do bunch of operation on few bytes at a time.

Which does look like hilarious waste of silicon, it could cut all of the float/division transistors off and still be plenty useful

CamperBob2 · on Nov 28, 2022

Exactly. It's good at providing speed improvements which will never be noticed in the real world.

janwas · on Nov 28, 2022

That's a bold claim which requires extraordinary evidence. Sufficient counterexamples, most already mentioned in this discussion: databases, NN-512, TLS(AES), JPEG XL decoding (1.5x speedup), ...

bob1029 · on Nov 29, 2022

> If I need to crunch large amounts of data in a hurry, I'll send it to a GPU. The CPU has no chance to compete with that.

Latency is a hell of a thing. Anything over a millisecond is an absolute eternity, and I know the GPU imposes more than that for most practical applications - especially gaming.

nwiswell · on Nov 28, 2022

What happens when you have a moderate amount of data and you can't wait for GPU latency?

CamperBob2 · on Nov 28, 2022

I can't think of a case like that, where the difference between AVX512 and other instruction sets would be humanly perceptible. Those situations usually end up constrained by memory bandwidth, not CPU or cache throughput.

I'm sure those cases exist, but they don't justify such a large chunk of silicon. And they definitely don't justify slowing down the rest of the CPU.

dragontamer · on Nov 28, 2022

Web Server. Say... this web page.

Inspector says that this web page, we're talking on is 48kB in size. That's small enough to fit inside L1 cache. Every single connection to the Web Server needs its own AES-key for encryption sake. The 48kB of data (such as my posts above, or your posts) are hot in Cache, but if 100 visitors to this webpage need to get it, HTTPS needs to happen 100x different times with 100x different keys.

So this 48kB text message (consisting of all the comments of this page) are going to have to be encrypted with different, random, AES keys to deliver the messages to you or me. AES operates on 16-bytes at a time, and AES-GCM is a newer algorithm that allows for all 48,000+ bytes to be processed in parallel.

AVX-512 AES instructions are ideal for processing this data, are they not? And processing them 4x faster (since 4x instances of AES are occurring in parallel, since AVX512 can work on 64 bytes per tick / 16 bytes per AES instance), is a lot better than just doing it 16-bytes at at time with the legacy AES-NI instructions.

----------

Despite being a parallel problem, this will never be worthwhile to send to the GPU. First, GPUs don't have AES-ni instructions. But even if they did, it would take longer to talk to the GPU than for the AVX512-AES instructions to operate on the 48kB of data (again: ~40,000 clock ticks to just start talking with the GPU in practice). In that amount of time, you would have finished encrypting the payload and have sent it off.

Dylan16807 · on Nov 28, 2022

I've seen a lot about AVX-512 and didn't know those instructions existed until just now. They're not exactly generic vector instructions. And that's a nice improvement, but is AES-NI ever slow enough to matter? The numbers I found were inconsistent but all very fast.

Probably more important, there's a 256 bit version of that instruction. You can get half of that extreme throughput without AVX-512.

dragontamer · on Nov 29, 2022

That Netflix guy who keeps optimizing their servers keeps coming back every year or so, talking about the latest optimizations he added.

http://nabstreamingsummit.com/wp-content/uploads/2022/05/202...

And a surprising amount of it was in TLS optimizations, in particular, offloading TLS to the hardware (Apparently Mellanox ConnectX ethernet adapters can do AES offload now, so the CPU doesn't have to worry about it).

Since Mellanox ConnectX adapters are trying to solve the AES problem still, I have to imagine that its a significant portion of a lot of server's workloads. Intel / AMD are obviously interested in it enough to upgrade AES to 4x wide in the AVX512 instruction set.

I can't say its particularly useful in any of _my_ workloads. But it seems to come up enough in those hyper-optimized web servers / presentations.

nwiswell · on Nov 28, 2022

Something that comes to mind is real-time controls, like for high speed manufacturing, rockets and jets, medical robots, etc. These computations are often highly vectorized and are extremely latency-sensitive, for obvious reasons.

microtonal · on Nov 28, 2022

Wasn't the issue that efficiency cores don't support AVX-512 and that operating system schedulers/software don't deal with this yet and end up running AVX-512 code on the efficiency cores?

cogman10 · on Nov 28, 2022

That's a terrible CPU design. You might as well ship arm cores if you are going to have a mismatch of instruction set support on efficiency vs power cores.

Especially since apps using AVX-512 will likely sniff for it at runtime. So now, you have a thread that if rescheduled onto an efficiency core it will break on you. So now what, does the app dev need to start sending "capabilities requests" when it makes new threads to ensure the OS knows not to put it's stuff on an efficiency core?

What a dumb design decision.

layer8 · on Nov 28, 2022

> it will break on you

Not necessarily, if the illegal-instruction interrupt maps it onto an emulation routine — or transfers the thread back to a performance core.

saagarjha · on Nov 29, 2022

At that point, why even bother?

crote · on Nov 28, 2022

It could be done dynamically by the scheduler: whenever a thread tries to use AVX-512 on the efficiency core, move it to the power core and keep it there for a certain amount of time. If I am not mistaken, the CPU also exposes instruction counters, which would allow the OS to determine whether a thread has tried using AVX during its last time slice.

In our modern multithreading world, many applications already have separate idle and worker threads. I would not be surprised if such an approach could be implemented with negligible performance drawbacks.

magicalhippo · on Nov 29, 2022

It could also be done by the application. At least on Windows, you can provide the OS with a bitmask of which (virtual) processors you want it to be scheduled on.

So the application could detect which cores had AVX-512 and change it's scheduling bitmask before doing the AVX-512 work.

The OS probably should do the dynamic stuff you mentioned, this would then be to avoid the initial hit for applications that care about that.

colejohnson66 · on Nov 29, 2022

I initially thought the same, but then realized the big issue with that. The x86 architecture, as it is (see below), requires that both core types appear to be homogenous. Therefore, if the E cores claim support for AVX-512 with CPUID (which would be a lie), then every application using glibc will try to use AVX-512 for memcpy (or whatever) when they shouldn't. As a result, they'd end up pinned to a P core when they should remain on an E core.

This whole mess is because AVX-512 was initially released on hardware where this distinction didn't exist. If AVX-512 was released during or after the whole P/E core thing last year, it would be possible to have applications using AVX-512 state their intentions to the scheduler (as @magicalhippo suggests). The application could say, "hey, OS, I need AVX-512 right now," and all would be well. As it is now, we're stuck with it.

pantalaimon · on Nov 28, 2022

That's why it was only enabled if you disabled the efficiency cores - until Intel fused it off.

And if Intel would have wanted this, they could have deviced the kernel to handle the fault and pin the process to the big cores.

formerly_proven · on Nov 28, 2022

This wouldn’t work, at least on Linux because libraries like glibc will put AVX512 usage into every process.

bogwog · on Nov 28, 2022

That was the exact same story I read, probably on this site.

Kon-Peki · on Nov 28, 2022

On the other hand, you can now virtually guarantee that a GPGPU is present on Intel consumer chips. So now you can write your code 3 times; no vector acceleration (for the really old stuff), AVX-512 for the servers, and GPU for the consumer chips!

convolvatron · on Nov 28, 2022

or we can finally starting looking into languages with implicit fine grained parallelism and run on whatever

goalieca · on Nov 28, 2022

We have had that for 15+ years even in C with fine grained looping like OpenMP. It’s terribly inefficient to communicate across threads. Sometimes you just need SIMD

gpderetta · on Nov 28, 2022

OMP is supposed to support SIMD/GPU offloading theses days. Not sure how good it is though.

convolvatron · on Nov 28, 2022

oh, sorry. I grew up on the connection machine. in that era we did have implicit SIMD languages and I thought they were great

Kon-Peki · on Nov 28, 2022

I support that effort. In the meantime, we have to do what we have to do. I'm presently in the process of optimizing some stuff and comparing the improvements of SIMD vs GPU on both ARM and x86 (this isn't that impressive, it's some basic loop vectorization). But just as Linus writes in that post, getting the compiler to do well seems impossible.

I'm measuring performance improvement and energy consumption reduction. The results are incredible. We really have to do this stuff. But it's complicated and the documentation is generally awful. So yes, a new language that deals with all of this would be very, very welcome :)

pjmlp · on Nov 28, 2022

Star Lisp!

Although any functional or HPC language like Chapel will do.

gpderetta · on Nov 28, 2022

... like intel ISPC.

Spoiler alert: it is a C dialect.

xgkickt · on Nov 29, 2022

Or HLSL. Can use ISPC as an intermediary language.

anonymoushn · on Nov 28, 2022

This won't help, most uses of AVX or NEON have nothing to do with parallel processing of a bunch of numbers

MisterTea · on Nov 28, 2022

How will these languages emit machine code for every variant of GPU/CPU/DSP/Vector/FPGA/whatever architecture they might run on? This isn't as simple as it sounds.

pbalcer · on Nov 29, 2022

The host binary will include intermediate representation of the compute code to be compiled by the device driver. This already exists, see SYCL + SPIR-V. Intel's oneAPI is one implementation of this approach.

convolvatron · on Nov 28, 2022

no, these are really big and interesting back ends. still, socially, that has to be better than _everyone_ breaking out the datasheets...if they even have those anymore

monocasa · on Nov 28, 2022

AVX-512 in general purpose CPUs was designed to not really make sense but start seeding the market at 10nm, vaguely make sense at 7nm, and really make sense starting at 5nm (or Intel's 14nm (Intel10) -> 10nm (Intel7) -> 7nm (Intel4)).

So Intel's process woes have been hurting their ability to execute in a meaningful way. Additionally Alder Lake was hurt by needing to pivot to heterogeneous cores (E cores and P cores), which hurt their ability to keep pushing this even in a 'maybe it makes sense, maybe it doesn't' state it had been in.

willis936 · on Nov 28, 2022

AVX-512 is in Alder Lake P-cores but not E-coree. AVX-512 is disabled in Alder Lake CPUs because the world is not ready for heterogenous ISA extensions. AVX-512 could be enabled by disabling E-cores.

It was supposedly maybe actually taken out of Raptor Lake, but the latest info I can find on that is from July: long before it was released. I have a Raptor Lake CPU but haven't found the time to experiment with disabling E-cores (far too busy overclocking memory and making sensible rainmeter skins).

paulmd · on Nov 28, 2022

Early Alder Lake could but they fused it off in the later ones and newer microcode also blocks it on the older processors.

Raptor Lake is basically "v-cache alder lake" (the major change is almost doubling cache size) so it's unsurprising they still don't have an answer there, and if they did it could be backported into Alder Lake, but they don't seem to have an immediate fix for AVX-512 in either generation.

Nobody really knows why, or what's up with Intel's AVX-512 strategy in general. I have not heard a fully conclusive/convincing answer in general.

The most convincing idea to me is Intel didn't want to deal with heterogeneous ISA between cores, maybe they are worried about the long-term tail of support that a mixed generation will entail.

Long term I think they will move to heterogeneous cores/architectures but with homogeneous ISA. The little cores will have it too, emulated if necessary, and that will solve all the CPUID-style "how many cores should I launch and what type" problems. That still permits big.little architecture mixtures, but fixes the wonky stuff with having different ISAs between cores.

there are probably some additional constraints like Samsung or whomever bumped into with their heterogenous-ISA architectures... like cache line size probably should not vary between big/little cores either. That will pin a few architectural features together, if they (or their impacts) are observable outside the "black box".

StillBored · on Nov 28, 2022

Well, intel seems to sorta be proving a point I had a long time about arm's big little setup. Which is that its only needed because their big cores wern't sufficiently advanced to be able to scale their power utilization efficiently. If you look at the intel power/perf curves, it the "Efficient" cores are anything but. Lots of people have noticed this, and pointed out its probably not "power Efficient" but rather "die space efficient under constrained power" because they have fallen behind in the density race, and their big cores are quite large.

But i'm not even so sure about that, avx-512 is probably part of the size problem with the cores. We shall see, your probably right that hetrogenious might be here to stay, but I suspect a better use of the space long term is even more capable GPU cores offloading the work that might be done on the little cores in the machine. AKA, you get a number of huge/fast/hot cores for all the general purpose "CPU" workloads, and then offload everything that is trivially parallelized to a GPU that is more closely bound to the cores and shares cache/interconnect.

paulmd · on Dec 1, 2022

it's funny because this is sort of addressing a point I made elsewhere in the comments.

https://news.ycombinator.com/item?id=33778016

Like, serious/honest question, how do you see Gracemont as space-efficient here? It's half the size of a full Zen3 core yet probably at-best produces the same perf-per-area, and uses 1.5x the transistors of Blizzard for similar performance (almost 3x the size, bearing in mind 5nm vs 7nm). that's not really super small, it's just that Intel's P-cores are truly massive, like wow that is a big core even before the cache comes in.

For years I thought it would be cool to see an all-out "what if we forget area and just build a really fast wide core" and that's what Intel did. And actually, for as much as people say Apple is using a huge "spare no expenses" core, it's not really all that big even considering the area - you get around 1.5-1.6x area scaling between 5nm and 7nm as demonstrated by both Apple cores and NVIDIA GPUs, and probably close to AMD's numbers as well. So just looking at it at a transistor level, Apple is using 2.55 x 1.6 = 4.08mm2 equivalent of silicon and Intel is using 5.55mm2, so Apple is only using 75% of the transistors of Intel's golden cove p-core...

But in the e-core space, it's become a meme that Gracemont is "size efficient rather than power efficient" and I'm just not sure what that means in practical terms. Usually high-density libraries are low-power, so those two things normally go together... and it's certainly not like they're achieving unprecedented perf-per-area, they're probably no better than Zen3 in that respect. Where is the space efficiency in this situation if it's not libraries or topline perf-per-area?

zeusk · on Nov 28, 2022

Raptor lake is not basically v-cache alder lake.

And Intel is still deciding on future of AVX512, internally there is already a replacement that works with atom cores (which are size and power bound).

pantalaimon · on Nov 28, 2022

> AVX-512 is disabled in Alder Lake CPUs because the world is not ready for heterogenous ISA extensions

It was already opt-in (disabled unless you also disable efficiency cores), that is no justification to make it impossible to use for people who want to try it out.

But I suppose Intel just doesn't want people to write software using those new instructions.

fomine3 · on Nov 29, 2022

If Intel allow to enable AVX-512, they need to validate functionality on every chip. Some chips may dropped (or reused as i3) due to this. There's not much reason to do so for AVX-512 that only enthusiasts enable.

jeffbee · on Nov 28, 2022

For some reason this doesn't hold on Raptor Lake, the successor to Alder Lake. AVX-512 works if you disable all the efficiency cores.