[Not a specialist, just a keen armchair fan of this sort of work]
> In addition to being a graph model, BDH admits a GPU-friendly formulation.
I remember about two years ago people spotting that if you just moved a lot of weights through a sigmoid and reduced floats down to -1, 0 or 1, we barely lost any performance from a lot of LLM models, but suddenly opened up the ability to use multi-core CPUs which are obviously a lot cheaper and more power efficient. And yet, nothing seems to have moved forward there yet.
I'd love to see new approaches that explicitly don't "admit a GPU-friendly formulation", but still move the SOTA forward. Has anyone seen anything even getting close, anywhere?
> It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data.
That is disappointing. It needs to do better, in some dimension, to get investment, and I do think alternative approaches are needed now.
From the paper though there are some encouraging side benefits to this approach:
> [...] a desirable form of locality: important data is located just next to the sites at which it is being processed. This minimizes communication, and eliminates the most painful of all bottlenecks for reasoning models during inference: memory-to-core bandwidth.
> Faster model iteration. During training and inference alike, BDH-GPU provides insight into parameter and state spaces of the model which allows for easy and direct evaluation of model health and performance [...]
> Direct explainability of model state. Elements of state of BDH-GPU are directly localized at neuron pairs, allowing for a micro-interpretation of the hidden state of the model. [...]
> New opportunities for ‘model surgery’. The BDH-GPU architecture is, in principle, amenable to direct composability of model weights in a way resemblant of composability of programs [...]
These, to my pretty "lay" eyes look like attractive features to have. The question I have is whether the existing transformer based approach is now "too big to fail" in the eyes of people who make the investment calls, and whether this will get the work it needs to get it from GPT2 performance to GPT5+.
/I'd love to see new approaches that explicitly don't "admit a GPU-friendly formulation", but still move the SOTA forward. Has anyone seen anything even getting close, anywhere?/
The speedup from using a GPU over a CPU is around 100x, as a rule of thumb. And there's been an immense amount of work maximizing throughput when training on a pile of GPUs together... And a sota model will still take a long time to train. So even if you do have a non-GPU algo which is better, it'll take you a very very long time to train it - by which point the best GPU algos will have also improved substantially.
Wow, that number requires STRONG caveats, lest it be called out as completely false.
Take away the tensor cores (unless you only do matmuls?), and an H100 has roughly 2x as many f32 flops as a Zen5 CPU, which is considerably cheaper. I suspect brute force HW/algorithms are not going to age well: https://www.sigarch.org/dont-put-all-your-tensors-in-one-bas...
(/personal opinion)
> In addition to being a graph model, BDH admits a GPU-friendly formulation.
I remember about two years ago people spotting that if you just moved a lot of weights through a sigmoid and reduced floats down to -1, 0 or 1, we barely lost any performance from a lot of LLM models, but suddenly opened up the ability to use multi-core CPUs which are obviously a lot cheaper and more power efficient. And yet, nothing seems to have moved forward there yet.
I'd love to see new approaches that explicitly don't "admit a GPU-friendly formulation", but still move the SOTA forward. Has anyone seen anything even getting close, anywhere?
> It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data.
That is disappointing. It needs to do better, in some dimension, to get investment, and I do think alternative approaches are needed now.
From the paper though there are some encouraging side benefits to this approach:
> [...] a desirable form of locality: important data is located just next to the sites at which it is being processed. This minimizes communication, and eliminates the most painful of all bottlenecks for reasoning models during inference: memory-to-core bandwidth.
> Faster model iteration. During training and inference alike, BDH-GPU provides insight into parameter and state spaces of the model which allows for easy and direct evaluation of model health and performance [...]
> Direct explainability of model state. Elements of state of BDH-GPU are directly localized at neuron pairs, allowing for a micro-interpretation of the hidden state of the model. [...]
> New opportunities for ‘model surgery’. The BDH-GPU architecture is, in principle, amenable to direct composability of model weights in a way resemblant of composability of programs [...]
These, to my pretty "lay" eyes look like attractive features to have. The question I have is whether the existing transformer based approach is now "too big to fail" in the eyes of people who make the investment calls, and whether this will get the work it needs to get it from GPT2 performance to GPT5+.