We are talking about 7B models ? Those can run on consumer GPUs with lower latency than A100s AFAIK (because gaming GPUs are clocked different).
Not to mention OpenAI has shit latency and terrible reliability - you should be using Azure models if you care about that - but pricing is also higher.
I would say fixed costs and development time is on openai side but I've seen people post great practical comparisons for latency and cost using hostes fine-tuned small models.
It runs fantastically well on M2 Mac + llama.cpp, such a variety of factors in the Apple hardware making it possible. The ARM fp16 vector intrinsics, the Macbook's AMX co-processor, the unified memory architecture, etc.
It's more than fast enough for my experiments and the laptop doesn't seem to break a sweat.
Not to mention OpenAI has shit latency and terrible reliability - you should be using Azure models if you care about that - but pricing is also higher.
I would say fixed costs and development time is on openai side but I've seen people post great practical comparisons for latency and cost using hostes fine-tuned small models.