Virtually every announcement of a new model release has some sort of table or gr...

CharlieDigital · on April 17, 2024

It would make sense, wouldn't it? Just as we've seen rising fuel efficiency, safety, dependability, etc. over the lifecycle of a particular car model.

The different teams are learning from each other and pushing boundaries; there's virtually no reason for any of the teams to release a model or product that is somehow inferior to a prior one (unless it had some secondary attribute such as requiring lower end hardware).

We're simply not seeing the ones that came up short; we don't even see the ones where it fell short of current benchmarks because they're not worth releasing to the public.

andai · on April 17, 2024

Sibling comment made a good point about benchmarks not being a great indiactor of real world quality. Every time something scores near GPT-4 on benchmarks, I try it out and it ends up being less reliable than GPT-3 within a few minutes of usage.

CharlieDigital · on April 17, 2024

That's totally fine, but benchmarks are like standardized tests like the SAT. They measure something and it totally makes sense that each release bests the prior in the context of these benchmarks.

It may even be the case that in measuring against the benchmarks, these product teams sacrifice some real world performance (just as a student that only studies for the SAT might sacrifice some real world skills).

apetresc · on April 17, 2024

That's a valid theory, a priori, but if you actually follow up you'll find that the vast majority of these benchmark results don't end up matching anyone's subjective experience with the models. The churn at the top is not nearly as fast as the press releases make it out to be.

tensor · on April 17, 2024

Subjective experience is not a benchmark that you can measure success against. Also, of course new models are better on some set of benchmarks. Why would someone bother releasing a "new" model that is inferior to old ones? (Aside from attributes like more preferable licensing).

This is completely normal, the opposite would be strange.