Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's a real danger that people will just run LLMs on their local GPUs. It's not a guarantee that an online service is a profitable idea.


Going to need some insanely beefy GPU cluster at your house to run anything close to GPT-4.

The best I can run is either LLaMa 65B at 1 token per second (too slow) on my CPU or LLaMa 30B on my GPU at quite a fast ~30 tokens a second.

Nowhere near the usefulness of GPT-4.


GPUs in the future will get more powerful. Especially if they are designed with LLMs in mind.


I wonder where a critical threshold might lie? For example, right now if you only have 24Gb of VRAM, the best you can run locally at usable speeds are the 4-bit quantized LLaMa 30B models, which are only semi-usable. If you have 48Gb however, you can run the 4-bit quantized LLaMa 65B, which is much better.

I assume there will be advances that basically make the context window practically infinite. That ought to make the lower parameter count models much more powerful on its own. I also assume they will become more efficient to run through sparsity/pruning, though I'm a total novice on this topic.

I wonder how many generations of hardware are we talking? One? Two? Or three? It feels like it's potentially within that range.


You could say the same about many SaaS projects. Why pay for an expensive GPU upfront and then some guy who can install it, configure it, create some sort of interface for you to talk to it... when you can just pay openai to do it for less money?


Because GPU-Servers that can run a typical LLM are less than 5k, which includes installation. Really, running an LLM seems to be no more complicated than running a NAS from a system administrators perspective.

Emphasis on running, not training.


5k can't even get you 80 GB of VRAM on a GPU. How is that possible?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: