Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Given what we just saw in terms of the DeepSeek team squeezing a lot of extra performance out of more efficient implementation on GPU, and the model still being optimized for GPU rather than CPU - is it unreasonable to think that in the $6k setup described, some performance might still be left on the table that could be squeezed out with some better optimization for these particular CPUs?


The answer to your question is yes. There is an open issue with llama.cpp about this very thing:

https://github.com/ggerganov/llama.cpp/issues/11333

The TLDR is that llama.cpp’s NUMA support is suboptimal, which is hurting performance versus what it should be on this machine. A single socket version likely would perform better until it is fixed. After it is fixed, a dual socket machine would likely run at the same speed as a single socket machine.

If someone implemented a GEMV that scales with NUMA nodes (i.e. PBLAS, but for the data types used in inference), it might be possible to get higher performance from a dual socket machine than we get from a single socket machine.


No, because the bottleneck is RAM bandwidth. This is already quantized and otherwise is essentially random so can't be compressed in any meaningful way.


How much bandwidth do we actually need per-token generation? Let's take one open-source model as a starting point since not all models are created the same.


For non-MoE models, it needs to flow the entire model through the CPU. So if it is a 32B parameter model quantised to 8b/parameter, that is 32GB of RAM bandwidth per token. If your RAM does 64GB/s that is 2 tok/s.


I didn't get the impression that the math around it is that simplistic. The first obvious reason I can think of now is the attention mechanism being used. Both GQA and MQA demand less compute and therefore less bandwidth than MHA.


How big are the active weights? That how much bandwidth you need per second per token.


Maybe a little, but FLOPs and memory bandwidth don't lie.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: