Are there any libraries that allow me to write different versions of the same fu...

bobowzki · on Nov 28, 2022

Libvolk works exactly like you describe.

https://www.libvolk.org/

curiousmindz · on Nov 28, 2022

In C#, you can write (pseudo-code): if (AVX2) { ... } else if (SSE) { ... } else ...

Then, when you run the program, the JIT will pick the first supported option and eliminate the other ones.

Rafuino · on Nov 29, 2022

Generally speaking the vendor libraries have dynamic dispatch support, which can identify which functions are available on a CPU and then deploy to the best at runtime. Intel got into some hot water for having their dispatch hurt performance on AMD CPUs, but it seems that's been fixed. Their IPP-Crypto and AMD's AOCL-Cryptography libraries both support dynamic dispatch these days, for example.

eesmith · on Nov 28, 2022

https://gcc.gnu.org/wiki/FunctionMultiVersioning ?

And in clang - https://releases.llvm.org/7.1.0/tools/clang/docs/AttributeRe... .

I've never used them.

Don't know about other compilers.

fsfod · on Nov 28, 2022

MSVC recently added there own version[1] of it controlled with a [[msvc::dispatch]] attribute thats needs an experimental compiler flag I think still.

1: C++ Function Multiversioning in Windows https://www.youtube.com/watch?v=LTM1was1dTU

janwas · on Nov 29, 2022

Interesting, did not know that, thanks.

It seems to me that a table of function pointers is all that's required. Highway is a little fancier in that the first entry is a trampoline that first detects CPU capabilities and then calls your pointer; subsequent calls go straight to the appropriate function.

Do the (experimental/non-portable) compiler versions contribute any additional value?

eesmith · on Nov 29, 2022

> ... contribute any additional value?

I gather from the linked-to video that binary-load-time selection has better run-time performance than init-at-first-call run-time dispatch, and doesn't have the tradeoff between performance and security.

janwas · on Nov 29, 2022

Thanks for the pointer. I read the video transcript and agree with their premise that indirect calls are slow. The are several ways to proceed from there. One could simply inline FastMemcpy into a larger block of code, and basically hoist the dispatch up until its overhead is low enough.

Instead, what they end up doing is pessimizing memcpy so that it is not inlined, and even goes through another thunk call, and defers the cost of patching until your code is paged in (which could be in a performance or latency-sensitive area). Indeed their microbenchmark does not prove a real-world benefit, i.e. that the thunks and patching are actually less costly than the savings from dispatch. It falls into the usual trap of repeating something 100K times, which implies perfect prediction which would not be the case in normal runs.

Also, the detection logic is limited to rules known to the OS; certainly sufficient for detecting AVX-512, probably harder to do something like "is it an AVX-512 where compressstoreu or vpconflict are super slow". And certainly impossible to do something reasonable like "just measure how my code performs for several codepaths and pick the best", or "specialize my code for SVE-256 in Graviton3".

So, besides the portability issue, and actually pessimizing short functions (instead of just inlining them), this prevents you from doing several interesting kinds of dispatch. Caveat emptor.

janwas · on Nov 28, 2022

Cool, I wasn't familiar with libvolk. It seems to be a collection of prewritten kernels, so it only helps if the function you want to write is among them.

github.com/google/highway seems to be closer to what is requested here. It provides around 200 portable intrinsics using which you can write a wide variety of functions (only a single implementation required). Disclosure: I am the main author.

wazoox · on Nov 28, 2022

Oooh, that's exactly the problem I had these days. The Firefox extension "Firefox translations" uses a tool (bergamot-translator) which requires SSE4 (or is it SSE3.1?) to run at all, so on all slightly old machines it fails miserably with a weird error. (90% of the machines I see around are 5 to 10 years old, work perfectly and don't need to be replaced, thank you very much, save money, think of the planet, this sort of things).

nynx · on Nov 28, 2022

Rust has built-in support for this, I believe.

3836293648 · on Nov 29, 2022

Rust's native support for this is super verbose and repetetive. There's a macro library that can deal with all the unsafe and version picking for you though.

fmajid · on Nov 28, 2022

http://maskray.me/blog/2021-01-18-gnu-indirect-function

mhh__ · on Nov 28, 2022

GCC can do it for you with autovectorization.

Doing the feature detection is fairly trivial. If they don't run the invalid instructions don't error.

astrange · on Nov 28, 2022

Autovectorization basically doesn't work. The effort to test it properly (that it got vectorized the way you expect or at all) is more maintenance than writing it yourself.

If you insist on abstractions, autoscalarization (the opposite approach) would be better, which is kind of how Fortran works… but I unironically recommend just writing assembly like ffmpeg does.