so in essence is it trading memory for speed?

HarHarVeryFunny · 2026-05-06T16:13:25 1778084005

Seems more like trading FLOPs for speed.

If you are just generating as usual with the main model then you're sequentially generating A -> AB -> ABC.

If I'm understanding correctly, what speculative decoding is doing is first (= more FLOPs) using a different small/fast (but less accurate) model to generate this ABC (you hope) sequence, then use the main model to now verify it in parallel (A + AB + ABC in parallel) rather then generate it sequentially. Assuming you had the FLOPs available to really do this in parallel, then this parallel verification vs sequential generation is what gives you the speed up.