Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Multimedia encoding/decoding,

Done on special-purpose hardware that is >10 times more efficient than any implementation on GP hardware.

> emulation,

Only current practical use of AVX-512.

> ML, matrix multiplications, ...

Much better to do on the GPU.

In general wide SIMD on CPU has the problem that to make effective use of it, you have to massage your code enough that you are ~95% away from just running it on the GPU anyway, and you can gain much more performance if you do that. The best niche of AVX-512 would have been as the baseline common target for things that also get optimized for GPUs... except that Intel has eliminated this possibility by heavy product segmentation right from the start.



> Done on special-purpose hardware that is >10 times more efficient than any implementation on GP hardware.

That locks users in to only the codecs that have been implemented in this particular hardware, and increases hardware size for particular vendor implementations rather than provide common building blocks that many codecs use (DCT, ...)

> Much better to do on the GPU.

Yep and that's what I run ML stuff on, not all systems have a GPU available though, and for some applications it's faster to do something immediately on the CPU, than have the overhead of going to/from GPU

> In general wide SIMD on CPU has the problem that to make effective use of it, you have to massage your code enough that you are ~95% away from just running it on the GPU anyway,

Better tooling and libraries could help with this imho. Note that the situation for GPU is here also not great at all, since the good tools are locked into 1 vendor.


> Much better to do on the GPU

Oh? Which GPU? iGPU? Discrete? Intel? Nvidia? AMD? From which generation? Using what libraries? Assuming you are running on an x86-64? Or maybe arm? Something else? How are you going to handle underflow/overflow? Did you need IEEE FP support? How many $1000s were you going to spend on hardware to test/verify your code on different GPUs?

Also depends on how big the matricies are, if they are too small and the latency of the GPU isn't worth it, too large and it won't fit. The too small/too large decision depends on which GPU, how many PCIe lanes, and which library.

So not such a simple decision.


Wasm vector pipeline instructions will provide an abstract target mapping to whatever hardware is available though?


> you have to massage your code enough that you are ~95% away from just running it on the GPU anyway

Depends on your usecase, I guess. Having AVX-512 on the CPU allows you to just re-implement one critical function to be faster, and let the rest of your code be clean and simple. Communicating with a GPU comes with a large latency penalty that is not acceptable in some use-cases.


Never mind driver, hardware, etc. dependencies and the deployment nightmare associated.


> you are ~95% away from just running it on the GPU anyway

Vector code with lots of branches absolutely exists. You can run it on a GPU, but because they don't dedicate transistors to OoO, branch prediction, and good prefetchers, the code won't run very well.


That 5% is pretty big arguably.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: