Given what we just saw in terms of the DeepSeek team squeezing a lot of extra pe...

ryao · on Jan 29, 2025

The answer to your question is yes. There is an open issue with llama.cpp about this very thing:

https://github.com/ggerganov/llama.cpp/issues/11333

The TLDR is that llama.cpp’s NUMA support is suboptimal, which is hurting performance versus what it should be on this machine. A single socket version likely would perform better until it is fixed. After it is fixed, a dual socket machine would likely run at the same speed as a single socket machine.

If someone implemented a GEMV that scales with NUMA nodes (i.e. PBLAS, but for the data types used in inference), it might be possible to get higher performance from a dual socket machine than we get from a single socket machine.

snovv_crash · on Jan 29, 2025

No, because the bottleneck is RAM bandwidth. This is already quantized and otherwise is essentially random so can't be compressed in any meaningful way.

menaerus · on Jan 30, 2025

How much bandwidth do we actually need per-token generation? Let's take one open-source model as a starting point since not all models are created the same.

snovv_crash · on Jan 30, 2025

For non-MoE models, it needs to flow the entire model through the CPU. So if it is a 32B parameter model quantised to 8b/parameter, that is 32GB of RAM bandwidth per token. If your RAM does 64GB/s that is 2 tok/s.

menaerus · on Jan 31, 2025

I didn't get the impression that the math around it is that simplistic. The first obvious reason I can think of now is the attention mechanism being used. Both GQA and MQA demand less compute and therefore less bandwidth than MHA.

ryao · on Jan 30, 2025

How big are the active weights? That how much bandwidth you need per second per token.

telotortium · on Jan 29, 2025

Maybe a little, but FLOPs and memory bandwidth don't lie.