Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are a few issues here. Running threaded code interacts badly with branch prediction. Essentially every "word" (operation) is followed by a branch to the next word, and those seem hard to predict (or at least with 2010-era hardware they were very hard to predict, maybe more modern branch prediction can do better?). The reason is that a popular word, say, +/add, might be called from all over the program. Only one copy of the word exists and the indirect jump to the next word could go anywhere.

However Forth does allow you to "inline" words, basically form a new word by chaining together existing words, which avoids this overhead. This doesn't optimise adjacent words, but it still gets semi-reasonable performance. (Inlining in JONESFORTH: https://github.com/nornagon/jonesforth/blob/d97a25bb0b06fb58...)

Modern Forths just have regular optimising compilers so none of this stuff applies, but they don't have the simple purity of one that you write and fully understand yourself.



Just an anecdotal experience, but I’m currently writing a JVM in Rust with a naive, interpreter loop. OpenJDK’s interpreter is a so-called template interpreter which looks at the method’s byte code and generates a given machine code snippet for each instruction, puts them serially next to each other and executes this buffer. The advantage of the latter is that in the interpreter loop model, one jumps back to the switch statement which will jump to the next instruction’s implementation, etc, while OpenJDK’s solution just continues execution serially, taking advantage of the branch predictor.

In my benchmarks that are definitely not suitable to infer much, OpenJDK without JIT can perform 1.5-8x better, though of course the whole implementation is different so don’t read too much into that.


It's amazing to me in general how much of a divergence in expectations of performance vs reality that branch prediction misses causes in modern software.

So many techniques that would improve performance on a 1980s processor can be woefully inefficient on a 21st century one. It's so easy to be still holding the computing model of the former in your head.


What is absolutely crazy is that many “primitive computation” is basically free. Like, there is more than likely some memory stalls either way and they just happen in the pauses not causing longer execution time.


I learned it the hard way when I was writing an emulator for a retro machine. Say you're blitting sprites to the screen. As you're drawing a line, each pixel either has some portion of a sprite, or doesn't. Your instinct is to do just like the hardware and 'chase the beam' and check at each pixel, or even just for each line, whether a sprite is present.

Super wrong. You're far better off precomputing the whole sprite bitmap -- even if you don't end up using it and doing a bulk operation to display it or not display it. Because doing that "if sprite is here?" in a loop is super super expensive, more expensive than just blitting the splits and not using them.

Most efficient ended up being precomputing things into boolean bitset vectors and using SIMD operations to do things based on those.

Even if not doing GPU stuff, the fastest way to compute these days is to think of everything in bulk bulk bulk. The hardware we have now is super efficient at vector and matrix operations, so try to take advantage of it.

(After doing this I have a hunch that a lot of the classic machine emulators <VICE, etc.> that are out there could be made way faster if they were rewritten in this way. )




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: