That is why we have SWE bench pro, they test architecture design too, turns out ...

SpicyLemonZest · 2026-05-05T14:38:00 1777991880

That's just not accurate. I haven't studied SWE Bench Pro in detail, so I can't tell you exactly what the flaw is, but SOTA models routinely make bad architectural choices I have to intervene to fix.

threepts · 2026-05-05T15:31:45 1777995105

You can read the paper here: https://labs.scale.com/papers/swe_bench_pro

TL;DR its very effective as it directly tests model on REAL codebases: "The benchmark is constructed from GPL-style copyleft repositories and private proprietary codebases". The use case is very real.

SpicyLemonZest · 2026-05-05T15:52:31 1777996351

It doesn't sound to me like this benchmark is attempting to measure architecture design. As far as I see in the paper, they do not evaluate the architectural quality of a task completion, only whether the model is capable of completing it at all.

dawnerd · 2026-05-05T16:23:29 1777998209

1000 dollars of subsidized tokens.