Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

so in essence is it trading memory for speed?


Seems more like trading FLOPs for speed.

If you are just generating as usual with the main model then you're sequentially generating A -> AB -> ABC.

If I'm understanding correctly, what speculative decoding is doing is first (= more FLOPs) using a different small/fast (but less accurate) model to generate this ABC (you hope) sequence, then use the main model to now verify it in parallel (A + AB + ABC in parallel) rather then generate it sequentially. Assuming you had the FLOPs available to really do this in parallel, then this parallel verification vs sequential generation is what gives you the speed up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: