This is where I see the economy of AI going: \* Inference becomes cheap - specia...

kcb · 2026-05-06T22:17:24 1778105844

There's no magic bullet for inference on cheap accelerators. Any accelerator will still require large amounts of high bandwidth memory.

exabrial · 2026-05-07T03:03:27 1778123007

The way to do it _today_ requires enormous amounts of HBM! However, we've never designed inference accelerators, which is actually a quite "trivial" problem, but we've just never had a need.

Groq (acqui-hired by NVidia) came up with a different processor architecture: metric shit-tons of SRAM attached to a modest single core deterministic processor. No HBM needed on this card, and 32x faster inference than today's best GPUs at inference!

These LPUs are pretty useless for training though, which is useful for companies training models! Training is expensive, inference is cheap (someday, not now).

There's also a Canadian company that _literally burned the model as a silicon mask_ on a chip. It's unbelievably (1000x) fast, but not flexible of course: https://chatjimmy.ai

kcb · 2026-05-07T03:13:16 1778123596

The point is metric shit-tons of SRAM is still large amounts of expensive memory.

exabrial · 2026-05-08T14:32:11 1778250731

SRAM and HBM are two completely different things though... SRAM is what your L1,L2,L3 caches are made of (most of the time, asterisks exist). This is something we've been doing for years and is a proven technology thats unbelievably cheap. It's all part of the processor.

HBM are their own chips and dies.

kcb · 2026-05-10T19:19:27 1778440767

SRAM is still high bandwidth memory...it's literally the most space inefficient and expensive type of memory.

CWwdcdk7h · 2026-05-07T11:14:27 1778152467

Strictly speaking there is that one startup that compiles entire models into huge ASIC. With trade off that entire hardware becomes outdated when new model version is released in 2-3 months.