Really excited to try this once it is merged into llama.cpp. Gemma 4 26B-A4B is ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		julianlam 18 days ago \| parent \| context \| favorite \| on: Accelerating Gemma 4: faster inference with multi-... Really excited to try this once it is merged into llama.cpp. Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing. Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)

VHRanger 18 days ago [–]

On vllm with a 5090 I get 120-180TPS with the awq 4 bit quant + MTP speculative decoding

For gemma4 26B, same quantization, I get >200TPS.

Also note that qwen is extremely inefficient in reasoning; the reasoning chains are ~3x longer than gemma on average

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact