Very interesting. If you're going for size instead of speed, then you can do tha...

near · on May 2, 2014

That will definitely work, and be more portable to esoteric ABIs and calling conventions. But it was more than 3x as slow on the Pentium 4, Athlon 64 and Core 2 Duo E6600. I haven't benchmarked since then. But you're pushing and restoring a whole bunch of volatile registers in vain.

Another fun detail, I tried using xchg r32,m32 to swap the stack pointer out in one instruction. Turns out that on the Pentium 4 (and probably others), the instruction is implemented in microcode now. Plus it's a guaranteed atomic operation. The benchmark I wrote ran at least 30x slower than with two mov instructions. I was absolutely blown away by that. People used that all the time in the 8086/80286 days to save a bit on code size (a much bigger deal back then.) Yet that same code, run today, can end up being substantially slower. Not knowing what opcodes will become slower in the future becomes a fairly compelling argument against writing inline assembly for speed.

ARM has nice register lists that you can use to mask out the volatile regs. So an optimal implementation is something like:

    push {r4-r11}
    stmia r1, {sp,lr}
    ldmia r0, {sp,lr}
    pop {r4-r11}
    bx lr

Moving on to amd64 ... Microsoft ignored the SystemV ABI (rbp,rbx,r12-r15 are non-volatile) and instead made xmm6-xmm15 non-volatile as well. This makes it more than twice as slow to perform a safe thread switch there. Even their own fibers implementation ignores xmm6-xmm15, unless a special flag is used.

Probably the strangest was the SPARC CPU. It has register windows for fast register saves/restores on leaf functions. Pretend your 16 regs were a block of memory. It gave you 16 blocks of that memory, and you could change one value to move to a new block of memory. When attempting threading, you couldn't know if you would recurse enough to exhaust this window. So you had no choice but to save and restore every single register window. Context switching became immensely demanding. So much so that GCC offered a compilation flag to not use register windows in binaries it produced.

The choice of volatile vs non-volatile is really fascinating. The less non-volatile you have, the faster both cooperative and preemptive task switching is. But it also means you have less registers that remain safe to use after function calls.

There's also caller vs callee non-volatility: either the caller has to back up the regs it thinks the callee will trample (or all of them); or the callee has to back up the regs it knows it will trample (but may end up backing up regs the caller doesn't actually care about.)