Are there any libraries that allow me to write different versions of the same function (AVX-512, AVX2, SSE, etc) and then automatically choose the best one that the system supports at runtime? Or maybe even better, the compiler generates multiple versions for me.
In other words, one binary that takes advantage of new instructions but still runs on older hardware. It doesn't really have to be older either, plenty of brand new CPUs doesn't support AVX-512.
Generally speaking the vendor libraries have dynamic dispatch support, which can identify which functions are available on a CPU and then deploy to the best at runtime. Intel got into some hot water for having their dispatch hurt performance on AMD CPUs, but it seems that's been fixed. Their IPP-Crypto and AMD's AOCL-Cryptography libraries both support dynamic dispatch these days, for example.
It seems to me that a table of function pointers is all that's required. Highway is a little fancier in that the first entry is a trampoline that first detects CPU capabilities and then calls your pointer; subsequent calls go straight to the appropriate function.
Do the (experimental/non-portable) compiler versions contribute any additional value?
I gather from the linked-to video that binary-load-time selection has better run-time performance than init-at-first-call run-time dispatch, and doesn't have the tradeoff between performance and security.
Thanks for the pointer. I read the video transcript and agree with their premise that indirect calls are slow.
The are several ways to proceed from there. One could simply inline FastMemcpy into a larger block of code, and basically hoist the dispatch up until its overhead is low enough.
Instead, what they end up doing is pessimizing memcpy so that it is not inlined, and even goes through another thunk call, and defers the cost of patching until your code is paged in (which could be in a performance or latency-sensitive area). Indeed their microbenchmark does not prove a real-world benefit, i.e. that the thunks and patching are actually less costly than the savings from dispatch. It falls into the usual trap of repeating something 100K times, which implies perfect prediction which would not be the case in normal runs.
Also, the detection logic is limited to rules known to the OS; certainly sufficient for detecting AVX-512, probably harder to do
something like "is it an AVX-512 where compressstoreu or vpconflict are super slow". And certainly impossible to do something reasonable like "just measure how my code performs for several codepaths and pick the best",
or "specialize my code for SVE-256 in Graviton3".
So, besides the portability issue, and actually pessimizing short functions (instead of just inlining them), this prevents you from doing several interesting kinds of dispatch.
Caveat emptor.
Cool, I wasn't familiar with libvolk. It seems to be a collection of prewritten kernels, so it only helps if the function you want to write is among them.
github.com/google/highway seems to be closer to what is requested here. It provides around 200 portable intrinsics using which you can write a wide variety of functions (only a single implementation required). Disclosure: I am the main author.
Oooh, that's exactly the problem I had these days. The Firefox extension "Firefox translations" uses a tool (bergamot-translator) which requires SSE4 (or is it SSE3.1?) to run at all, so on all slightly old machines it fails miserably with a weird error. (90% of the machines I see around are 5 to 10 years old, work perfectly and don't need to be replaced, thank you very much, save money, think of the planet, this sort of things).
Rust's native support for this is super verbose and repetetive. There's a macro library that can deal with all the unsafe and version picking for you though.
Autovectorization basically doesn't work. The effort to test it properly (that it got vectorized the way you expect or at all) is more maintenance than writing it yourself.
If you insist on abstractions, autoscalarization (the opposite approach) would be better, which is kind of how Fortran works… but I unironically recommend just writing assembly like ffmpeg does.
In other words, one binary that takes advantage of new instructions but still runs on older hardware. It doesn't really have to be older either, plenty of brand new CPUs doesn't support AVX-512.