> The reason this matters is that LLMs are incredibly nifty often useful tools that are not AGI and also seem to be hitting a scaling wall
I don't know who needs to hear this, but the real break through in AI that we have had is not LLMs, but generative AI. LLM is but one specific case. Furthermore, we have hit absolutely no walls. Go download a model from Jan 2024, another from Jan 2025 and one from this year and compare. The difference is exponential in how well they have gotten.
I've been wondering about this for quite a while now. Why does everybody automatically assume that I'm using the decimal system when saying "orders of magnitude"?!
Because, as xkcd 169 says, communicating badly and then actung smug when you're misunderstood is not cleverness. "Orders of magnitude" refers to a decimal system in the vast majority of uses (I must admit I have no concrete data on this, but I can find plenty of references to it being base-10 and only a suggestion that it could be sometihng else).
Unless you've explicitly stated that you mean something else, people have no reason to think that you mean something else.
There is a lot of talking past each other when discussing LLM performance. The average person whose typical use case is asking ChatGPT how long they need to boil an egg for hasn't seen improvements for 18 months. Meanwhile if you're super into something like local models for example the tangible improvements are without exaggeration happening almost monthly.
> The average person whose typical use case is asking ChatGPT how long they need to boil an egg for hasn't seen improvements for 18 months
I don’t think that’s true. I think both my mother and my mother-in-law would start to complain pretty quickly if they got pushed back to 4o. Change may have felt gradual, but I think that’s more a function of growing confidence in what they can expect the machine to do.
I also think “ask how long to boil an egg” is missing a lot here. Both use ChatGPT in place of Google for all sorts of shit these days, including plenty of stuff they shouldn’t (like: “will the city be doing garbage collection tomorrow?”). Both are pretty sharp women but neither is remotely technical.
GP was talking about commercially hosted LLMs running in datacenters, not free Chinese models.
Local is definitely still improving. That’s another reason the megacenter model (NVDA’s big line up forever plan) is either a financial catastrophe about to happen, or the biggest bailout ever.
5.2 is great if you ask it engineering questions, or questions an engineer might ask. It is extremely mid, and actually worse than the o3/o4 era models if you start asking it trivia like if the I-80 tunnel on the bay bridge (yerba buena island) is the largest bore in the world. Don't even get me started on whatever model is wired up to the voice chat button.
But yes it will write you a flawless, physics accurate flight simulator in rust on the first try. I've proven that. I guess what I'm trying to say is Anthropic was eating their lunch at coding, and OpenAI rose to the challenge, but if you're not doing engineering tasks their current models are arguably worse than older ones.
In addition to engineering tasks, it's an ad-free answer-box, outside of cross checking things, or browsing search results it's totally replaced Google/search engine use for me. I also pay for Kagi for search. In the last year I've been able to fully divorce myself from the google ecosystem besides gmail and maps.
According to OpenAI it's something like 4.2% of the use. But this data is from before Codex added subscription support and I think only covers ChatGPT (back when most people were using ChatGPT for coding work, before agents got good).
The execs I've talked to, they are paying for it to answer capex questions, as a sounding board for decision making, and perhaps most importantly, crafting/modifying emails for tone/content. In the bay area particularly a lot of execs are foreign with english as their second language and LLMs can cut email generation time in half.
I'd believe that but I was commenting on who actually pays for it. My guess is that most individuals using AI in their personal lives are using some sort of free tier.
Yeah agreed, there were some minor gains, but new releases are mostly benchmark overfit sycopanthic bullshit that are only better on paper and horrible to use. The more synthetic data they add the less world knowledge the model has and the more useless it becomes. But at least they can almost mimic a basic calculator now /s
For api models, OpenAI's releases have regularly not been an improvement for a long while now. Is sonnet 4.5 better than 3.5 outside pretentius agentic workflows it's been trained for? Basically impossible to tell, they make the same braindead mistakes sometimes.
I don't know who needs to hear this, but the real break through in AI that we have had is not LLMs, but generative AI. LLM is but one specific case. Furthermore, we have hit absolutely no walls. Go download a model from Jan 2024, another from Jan 2025 and one from this year and compare. The difference is exponential in how well they have gotten.