That's what I thought too when I saw the numbers; OK, he managed to get it 18% f...

That's what I thought too when I saw the numbers; OK, he managed to get it 18% faster, but how much bigger was the result? I'm guessing far more than 18% bigger. I've seen cases where replacing a function with a smaller one, but one which microbenchmarked slower, actually lead to the application as a whole performing much faster due to more code fitting in the cache. Ditto for individual instructions; (unfortunately - but Intel seems to be slowly changing this) many of the shortest instructions aren't the fastest when considered in isolation, so compilers tend to avoid them, but the 2-3x difference there is far less than the cost of a single cache miss.

64K of L1 might seem like a huge amount for a function to fit into, but that's 64K shared among every other often-used piece of code. It's for this reason that loop unrolling is no longer recommended too.

The first one is alignment of branches: the assembly code contains no instructions to align basic blocks on particular branches, whereas gcc happily emits these for some basic blocks. I mention this first as it is mere conjecture; I never made an attempt to measure the effects for myself.

And if you were, you would likely not see much on a modern x86; in fact it would probably make things slower, as it may confuse the branch predictor --- instructions are variable-length, so their addresses naturally are not aligned, and that enables e.g. lower-order bits to be used as cache tags.