I'm a little mystified at people taking about qwen 3.6 27b/ gemma 31b being slow in one breath and then saying they're using a 16GB gpu in the next.
You do need to use sutable hardware.
I get 50tok/s from Qwen 3.6 27b with Q8 & MTP (I can get more aggregate tok/s in parallel rather than using MOE, but don't have enough memory for too many full sized contexts) and 100 tok/s with 35B-A3b Q8 (no MTP as it's not that useful with MOE) on a single workstation gpu that I spent 3k on a couple years ago.
These speeds are somewhat faster than what I've seen from commercial SOTA models, they're plenty fast for many applications.
You do need to use sutable hardware.
I get 50tok/s from Qwen 3.6 27b with Q8 & MTP (I can get more aggregate tok/s in parallel rather than using MOE, but don't have enough memory for too many full sized contexts) and 100 tok/s with 35B-A3b Q8 (no MTP as it's not that useful with MOE) on a single workstation gpu that I spent 3k on a couple years ago.
These speeds are somewhat faster than what I've seen from commercial SOTA models, they're plenty fast for many applications.