> Rather the CPU vendors have been throwing transistors at doing more math opera...

> Rather the CPU vendors have been throwing transistors at doing more math operations in parallel. All you need to do is make sure you're taking advantage of the vector instructions, whether that's done by the compiler or with intrinsic calls.

Instruction-level parallelism is incredibly important. I'd say that any optimizing programmer needs to fully understand ILP, and how it interacts with pipelines and dependency cutting (and register renaming).

Modern CPUs are extremely well parallelized with ILP. Any good, modern hash function will take advantage of this feature of modern CPUs.

Case in point, it seems like xxhash is SCALAR 32-bit / 64-bit code. No vectorization as far as I can tell, its purely using ILP to get its speed.

Intel Assembly has a 64-bit multiplier (but vectorized only has 32-bit multipliers). I've theorized to myself that this 64-bit multiplier could lead to better mixing than the vectorized instructions, and it seems like xxhash goes for that.

The 32-bit version of xxhash can likely be vectorized and optimized even further.