> Multimedia encoding/decoding, Done on special-purpose hardware that is >10 tim...

Aardwolf · on Nov 28, 2022

> Done on special-purpose hardware that is >10 times more efficient than any implementation on GP hardware.

That locks users in to only the codecs that have been implemented in this particular hardware, and increases hardware size for particular vendor implementations rather than provide common building blocks that many codecs use (DCT, ...)

> Much better to do on the GPU.

Yep and that's what I run ML stuff on, not all systems have a GPU available though, and for some applications it's faster to do something immediately on the CPU, than have the overhead of going to/from GPU

> In general wide SIMD on CPU has the problem that to make effective use of it, you have to massage your code enough that you are ~95% away from just running it on the GPU anyway,

Better tooling and libraries could help with this imho. Note that the situation for GPU is here also not great at all, since the good tools are locked into 1 vendor.

sliken · on Nov 28, 2022

> Much better to do on the GPU

Oh? Which GPU? iGPU? Discrete? Intel? Nvidia? AMD? From which generation? Using what libraries? Assuming you are running on an x86-64? Or maybe arm? Something else? How are you going to handle underflow/overflow? Did you need IEEE FP support? How many $1000s were you going to spend on hardware to test/verify your code on different GPUs?

Also depends on how big the matricies are, if they are too small and the latency of the GPU isn't worth it, too large and it won't fit. The too small/too large decision depends on which GPU, how many PCIe lanes, and which library.

So not such a simple decision.

singularity2001 · on Nov 29, 2022

Wasm vector pipeline instructions will provide an abstract target mapping to whatever hardware is available though?

dahfizz · on Nov 28, 2022

> you have to massage your code enough that you are ~95% away from just running it on the GPU anyway

Depends on your usecase, I guess. Having AVX-512 on the CPU allows you to just re-implement one critical function to be faster, and let the rest of your code be clean and simple. Communicating with a GPU comes with a large latency penalty that is not acceptable in some use-cases.

snovv_crash · on Nov 28, 2022

Never mind driver, hardware, etc. dependencies and the deployment nightmare associated.

hajile · on Nov 28, 2022

> you are ~95% away from just running it on the GPU anyway

Vector code with lots of branches absolutely exists. You can run it on a GPU, but because they don't dedicate transistors to OoO, branch prediction, and good prefetchers, the code won't run very well.

mhh__ · on Nov 28, 2022

That 5% is pretty big arguably.