I agree. HyperTransport was stunning when it came out. Revolutionary, even. Ditto for AMD64, still the standard as we speak.
I'm not sure, however, if this is due to Intel resting on their laurels vs an entire Intel generation being shown the door because of epic (excuse the pun) fuckups. With the (largely wasted) effort expended on Itanium / Itanium 2 / EPIC / IXP / Netburst / etc, no wonder other vendors excelled. The MHz wars took a horrible toll on Intel for mainstream x86 with things like Prescott, with its 31 stage (!!) pipeline.
Stalls on Prescott were horrific for performance. On IXP, microcode screwups (often due to explicit caching) were horrific for customers. On Itanium, everything was horrific. I doubt we will ever know exactly how much these escapades cost humanity. On the other hand, maybe we're all richer for lessons learned.
It seems Intel is not interested in screwing up so badly anymore, so I think it's the competitors' turn to sweat. Intel still has a long way to go in recapturing territory it could have already had; ARM and MIPS have come a long way in the phone/server and NPU/packet processing space respectively, and they don't look as easily dislodged as AMD...
It's even worse than you enumerate: echoing lsc and his mention of "stunningly inefficient and expensive rambus ram", the highest level architects at Intel were petrified by DRAM size, I think it was, concerns, and ordered some stunningly stupid things that at least in some cases the engineers under them knew wouldn't work. Intel had not one, but two *1 million part recalls", one of which was for motherboards just before OEMs were going to start shipping.
And AMD, which only occasionally manages this, did everything right for a short period of time with their K8 microarchitecture (P6 style, 64 bits, HyperTransport plus on-chip memory controller) while Intel was screwing up so much.
I wonder how history would have gone if they hadn't then taken 2.5 years to start delivering the successor K10 microarchitecture, and another half a year to deliver one that didn't have a screwed up TLB. Intel is not the sort of adversary you can just give three years to get its act together, especially with their historical manufacturing prowess keeping them at least a process node ahead of you (and pretty much everyone else?).
Intel is closing the gap with their DPDK, but Cavium creams them on clock-by-clock and on cost of goods.
Cavium Octeon chips currently scale to 32 cores at 1.4Ghz, but with ZIP, GZIP, AES, SHA1, etc coprocessors running at 800Mhz. All cores share a fast, coherent unified L2.
One of the key advantages of the Octeon architecture is their hardware work scheduling unit. This is essentially a highly programmable hash engine on packet fields (with software-only bits for software classify-then-reschedule). The idea is to ensure that no packets with identical hashes are in flight on any core at the same time.
If programmed correctly, this work scheduling prevents data structure contention, which is particularly problematic when you scale to 32 (and next-gen up to 48 [then I believe to 64] cores).
The chips also support direct packet transport (XAUI, SGMII, etc), rather than requiring transport across PCI-e. Each of these ports can be programmed separately, so you can use switch-specific goofy encapsulation modes (Broadcom HiGig2, Marvell DSA, etc) to support very quick traffic <-> physical port mappings.
I should also mention that Cavium scales down very well, all of the way to configurations like 2 cores at 400Mhz for PoS, SOHO usage, and such. So it can be an attractive architecture to target.
Finally, Octeon family MIPS64 has a lot of MIPS64 extensions, like branch on bit, posted atomic operations (e.g. statistics, where you don't care about the value, you just want to += 42 it), pop count, fast bitfield subfield extract, etc.
I'm not sure, however, if this is due to Intel resting on their laurels vs an entire Intel generation being shown the door because of epic (excuse the pun) fuckups. With the (largely wasted) effort expended on Itanium / Itanium 2 / EPIC / IXP / Netburst / etc, no wonder other vendors excelled. The MHz wars took a horrible toll on Intel for mainstream x86 with things like Prescott, with its 31 stage (!!) pipeline.
Stalls on Prescott were horrific for performance. On IXP, microcode screwups (often due to explicit caching) were horrific for customers. On Itanium, everything was horrific. I doubt we will ever know exactly how much these escapades cost humanity. On the other hand, maybe we're all richer for lessons learned.
It seems Intel is not interested in screwing up so badly anymore, so I think it's the competitors' turn to sweat. Intel still has a long way to go in recapturing territory it could have already had; ARM and MIPS have come a long way in the phone/server and NPU/packet processing space respectively, and they don't look as easily dislodged as AMD...