This is the problem: you need the best model, not just a good one, for:
- Good architecture, which requires reading specs, code, etc. reads like: lots of tokens in/out
- Bug fixing — same, plus logs, e.g. datadog
Once you've found the path, patches are trivial and the savings are tiny unless you're doing refactoring/cleanup.
testing gets more and more complicated. Take a look at opencode go, and you see this:
>Includes GLM-5.1, GLM-5, Kimi K2.5, Kimi K2.6, MiMo-V2-Pro, MiMo-V2-Omni, MiMo->V2.5-Pro, MiMo-V2.5, Qwen3.5 Plus, Qwen3.6 Plus, MiniMax M2.5, MiniMax M2.7, >DeepSeek V4 Pro, and DeepSeek V4 Flash
and now on your own with bugs, all of these models can produce at scale. Am i missing anything in this picture. What is the real use of cheaper models?
any missed bug, any wrong architecture decision, is a huge loss, sure , if you run it as autocomplete on steroids you can get any Chinese model. If you try to move faster, and that is a conscious choice, any hiccup is a productivity loss and tons of tokens burned.
Once you've found the path, patches are trivial and the savings are tiny unless you're doing refactoring/cleanup.
testing gets more and more complicated. Take a look at opencode go, and you see this:
>Includes GLM-5.1, GLM-5, Kimi K2.5, Kimi K2.6, MiMo-V2-Pro, MiMo-V2-Omni, MiMo->V2.5-Pro, MiMo-V2.5, Qwen3.5 Plus, Qwen3.6 Plus, MiniMax M2.5, MiniMax M2.7, >DeepSeek V4 Pro, and DeepSeek V4 Flash
and now on your own with bugs, all of these models can produce at scale. Am i missing anything in this picture. What is the real use of cheaper models?