Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Check chatjimmy.ai


https://chatjimmy.ai being a demo of the "burn the model to an ASIC" approach being sold by Taalas[0], an approach which they use to run Llama 3.1 8B at ~17000 tokens per second.

[0] - https://taalas.com/products/


Not to downplay their accomplishment but Llama 3.1 8B is a terrible model. It's really outdated at this point. It's cool that they were able to accelerate a model with silicon, but it also feels wasteful since llama 8B is such a useless model?


I guess their point was to demonstrate that it's possible to bake a decently-sized model to a silicon? As with anything related to HW, I guess the lead time will be considerably larger than the software counterparts, so I guess in 1-2 years timeframe we might see something like Gemma 4 baked onto a silicon.


Yeah, I think the important part is the process to convert the model to silicon, not the actual implementation itself.

Whether it succeeds now depends a lot on the rate of improvement of model architecture. They're betting on model design and capability improvements slowing down - and then wiping the floor with everyone else with their inference economics.


I think this is the future. When models start converging at "really good" (which I think is already happening) then burning them into ASIC silicon is the natural next step.

Harnesses can keep improving with a fixed model and the throughput opens up new possibilities like doing 10x more "thinking" or exploring parallel paths and picking the best.


I agree, Gemma 3 12B is a very good model for its size and it was only obsoleted by Gemma 4.

Heck, I'm still a fan of Gemma 2 9B.


is it still a useless model if, say, you can run it at (prompt+output)*24/s and use it to make executive function decisions?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: