I'm not getting anywhere near the speeds advertised on my 3090 Ti, alas, but it'...

osanseviero · 2026-06-11T07:48:26 1781164106

Hi! What implementation are you using? Right now VLLM is the one recommended. llama.cpp is in an early draft

petercooper · 2026-06-11T10:24:39 1781173479

Yeah, the patched llama.cpp. The reason is I saw that using the Q4 quant on vLLM is discouraged and the int8 won't fit on my 3090 Ti, but I could certainly give it a go. I also skipped Transformers as it needs to download the full weights and quantize them locally and I didn't fancy waiting for a 50GB download.