Hacker Newsnew | past | comments | ask | show | jobs | submit | XCSme's commentslogin

That's interesting. There's not much we can do to test whether we get the same model...

I also don't trust the maxbenched results.

I am thus making my own benchmarks: https://aibenchy.com


In your benchmark, GPT 5 Nano is basically tied with Opus?

Yes. Opus could do a lot better, but fails a lot because it doesn't respect the given formatting instructions/output format.

I could modify the tests to emphasize the requirements, but then, what's the point of a test. In real life, we expect the AI to do something if we ask it, especially for agentic use-case or in n8n, because if the output is slightly wrong, the entire workflow fails.


Also, not really tied, Opus has a lot better consistency and reasoning score (which means the reasoning made sense, only the final output was wrong).

I got similar results for most models, with gemini 3 flash (with reasoning) being the most consistent/reliable model: https://aibenchy.com

I also noticed the same thing: some models reason correctly but draw the wrong conclusions.

And MiniMax m2.5 just reasons forever (filling the entire reasoning context) and gives wrong answers. This is why it's #1 on OpenRouter, it burns through tokens.


Fun read. Math makes so much intuitive sense in his head.

Thanks, just added UXWizz there. First one in its city in Romania :D

How reliable is sendBeacon though? Last time I checked, maybe browsers had it disabled by default.


Not so bad, it's above MiniMax M2.5 with reasoning in my tests (mostly because MiniMax m2.5 seems to reason forever):

https://aibenchy.com/compare/?left=stepfun-step-3-5-flash-fr...


Funnily, on my tests, 3 flash with medium reasoning does better. Seems like 3.1 pro reasoned about the correct answer, but chose to go with a different (wrong) one: https://aibenchy.com/compare/?left=google-gemini-3-flash-pre...

EDIT: while also being 3x cheaper



Now I need to write more tests.

It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.


Are you intentionally keeping the benchmarks private?

Yes.

I am trying to think what's the best way to give most information about how the AI models fail, without revealing information that can help them overfit on those specific tests.

I am planning to add some extra LLM calls, to summarize the failure reason, without revealing the test.


Added one more test, which surprisingly gemini flash 3 reasoning passes, but gemini 3.1 pro not

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: