Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm a little mystified at people taking about qwen 3.6 27b/ gemma 31b being slow in one breath and then saying they're using a 16GB gpu in the next.

You do need to use sutable hardware.

I get 50tok/s from Qwen 3.6 27b with Q8 & MTP (I can get more aggregate tok/s in parallel rather than using MOE, but don't have enough memory for too many full sized contexts) and 100 tok/s with 35B-A3b Q8 (no MTP as it's not that useful with MOE) on a single workstation gpu that I spent 3k on a couple years ago.

These speeds are somewhat faster than what I've seen from commercial SOTA models, they're plenty fast for many applications.

 help



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: