Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With a single NVIDIA 3090 and the fastest inference branch of GPTQ-for-LLAMA https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/fastest-i..., I get a healthy 10-15 tokens per second on the 30B models. IMO GGML is great (And I totally use it) but it's still not as fast as running the models on GPU for now.


> IMO GGML is great (And I totally use it) but it's still not as fast as running the models on GPU for now.

I think it was originally designed to be easily embeddable—and most importantly, native code (i.e. not Python)—rather than competitive with GPUs.

I think it's just starting to get into GPU support now, but carefully.


Have you tried the most recent cuda offload? A dev claims they are getting 26.2ms/token (38 tokens per second) on 13B with a 4080.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: