There are a few issues here. Running threaded code interacts badly with branch p...

kaba0 · on May 13, 2022

Just an anecdotal experience, but I’m currently writing a JVM in Rust with a naive, interpreter loop. OpenJDK’s interpreter is a so-called template interpreter which looks at the method’s byte code and generates a given machine code snippet for each instruction, puts them serially next to each other and executes this buffer. The advantage of the latter is that in the interpreter loop model, one jumps back to the switch statement which will jump to the next instruction’s implementation, etc, while OpenJDK’s solution just continues execution serially, taking advantage of the branch predictor.

In my benchmarks that are definitely not suitable to infer much, OpenJDK without JIT can perform 1.5-8x better, though of course the whole implementation is different so don’t read too much into that.

cmrdporcupine · on May 13, 2022

It's amazing to me in general how much of a divergence in expectations of performance vs reality that branch prediction misses causes in modern software.

So many techniques that would improve performance on a 1980s processor can be woefully inefficient on a 21st century one. It's so easy to be still holding the computing model of the former in your head.

kaba0 · on May 13, 2022

What is absolutely crazy is that many “primitive computation” is basically free. Like, there is more than likely some memory stalls either way and they just happen in the pauses not causing longer execution time.

cmrdporcupine · on May 14, 2022

I learned it the hard way when I was writing an emulator for a retro machine. Say you're blitting sprites to the screen. As you're drawing a line, each pixel either has some portion of a sprite, or doesn't. Your instinct is to do just like the hardware and 'chase the beam' and check at each pixel, or even just for each line, whether a sprite is present.

Super wrong. You're far better off precomputing the whole sprite bitmap -- even if you don't end up using it and doing a bulk operation to display it or not display it. Because doing that "if sprite is here?" in a loop is super super expensive, more expensive than just blitting the splits and not using them.

Most efficient ended up being precomputing things into boolean bitset vectors and using SIMD operations to do things based on those.

Even if not doing GPU stuff, the fastest way to compute these days is to think of everything in bulk bulk bulk. The hardware we have now is super efficient at vector and matrix operations, so try to take advantage of it.

(After doing this I have a hunch that a lot of the classic machine emulators <VICE, etc.> that are out there could be made way faster if they were rewritten in this way. )