I wonder where a critical threshold might lie? For example, right now if you only have 24Gb of VRAM, the best you can run locally at usable speeds are the 4-bit quantized LLaMa 30B models, which are only semi-usable. If you have 48Gb however, you can run the 4-bit quantized LLaMa 65B, which is much better.
I assume there will be advances that basically make the context window practically infinite. That ought to make the lower parameter count models much more powerful on its own. I also assume they will become more efficient to run through sparsity/pruning, though I'm a total novice on this topic.
I wonder how many generations of hardware are we talking? One? Two? Or three? It feels like it's potentially within that range.
You could say the same about many SaaS projects. Why pay for an expensive GPU upfront and then some guy who can install it, configure it, create some sort of interface for you to talk to it... when you can just pay openai to do it for less money?
Because GPU-Servers that can run a typical LLM are less than 5k, which includes installation. Really, running an LLM seems to be no more complicated than running a NAS from a system administrators perspective.