More

XCSme · 2026-02-24T10:18:41 1771928321

That's interesting. There's not much we can do to test whether we get the same model...

XCSme · 2026-02-24T10:16:42 1771928202

I also don't trust the maxbenched results.

I am thus making my own benchmarks: https://aibenchy.com

andai · 2026-02-24T12:56:32 1771937792

In your benchmark, GPT 5 Nano is basically tied with Opus?

XCSme · 2026-02-24T13:02:08 1771938128

Yes. Opus could do a lot better, but fails a lot because it doesn't respect the given formatting instructions/output format.

I could modify the tests to emphasize the requirements, but then, what's the point of a test. In real life, we expect the AI to do something if we ask it, especially for agentic use-case or in n8n, because if the output is slightly wrong, the entire workflow fails.

XCSme · 2026-02-24T13:03:41 1771938221

Also, not really tied, Opus has a lot better consistency and reasoning score (which means the reasoning made sense, only the final output was wrong).

XCSme · 2026-02-24T10:11:29 1771927889

I got similar results for most models, with gemini 3 flash (with reasoning) being the most consistent/reliable model: https://aibenchy.com

I also noticed the same thing: some models reason correctly but draw the wrong conclusions.

And MiniMax m2.5 just reasons forever (filling the entire reasoning context) and gives wrong answers. This is why it's #1 on OpenRouter, it burns through tokens.

XCSme · 2026-02-24T09:59:57 1771927197

Fun read. Math makes so much intuitive sense in his head.

XCSme · 2026-02-23T16:58:30 1771865910

Thanks, just added UXWizz there. First one in its city in Romania :D

XCSme · 2026-02-23T16:16:14 1771863374

How reliable is sendBeacon though? Last time I checked, maybe browsers had it disabled by default.

XCSme · 2026-02-23T14:23:43 1771856623

It's benchmaxxed: https://aibenchy.com/model/minimax-minimax-m2-5-medium/

XCSme · 2026-02-20T16:10:12 1771603812

Not so bad, it's above MiniMax M2.5 with reasoning in my tests (mostly because MiniMax m2.5 seems to reason forever):

https://aibenchy.com/compare/?left=stepfun-step-3-5-flash-fr...

XCSme · 2026-02-20T00:00:09 1771545609

Funnily, on my tests, 3 flash with medium reasoning does better. Seems like 3.1 pro reasoned about the correct answer, but chose to go with a different (wrong) one: https://aibenchy.com/compare/?left=google-gemini-3-flash-pre...

EDIT: while also being 3x cheaper

XCSme · 2026-02-19T18:54:16 1771527256

Gets 10/10 on my potato benchmarks: https://aibenchy.com/model/google-gemini-3-1-pro-preview-med...

XCSme · 2026-02-19T18:58:50 1771527530

Now I need to write more tests.

It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.

thevinter · 2026-02-19T20:42:10 1771533730

Are you intentionally keeping the benchmarks private?

XCSme · 2026-02-19T20:52:51 1771534371

Yes.

I am trying to think what's the best way to give most information about how the AI models fail, without revealing information that can help them overfit on those specific tests.

I am planning to add some extra LLM calls, to summarize the failure reason, without revealing the test.

XCSme · 2026-02-19T20:55:56 1771534556

Added one more test, which surprisingly gemini flash 3 reasoning passes, but gemini 3.1 pro not