Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That is why we have SWE bench pro, they test architecture design too, turns out 1000 dollars of tokens outperform 10k dollars of labor in meta design.
 help



That's just not accurate. I haven't studied SWE Bench Pro in detail, so I can't tell you exactly what the flaw is, but SOTA models routinely make bad architectural choices I have to intervene to fix.

You can read the paper here: https://labs.scale.com/papers/swe_bench_pro

TL;DR its very effective as it directly tests model on REAL codebases: "The benchmark is constructed from GPL-style copyleft repositories and private proprietary codebases". The use case is very real.


It doesn't sound to me like this benchmark is attempting to measure architecture design. As far as I see in the paper, they do not evaluate the architectural quality of a task completion, only whether the model is capable of completing it at all.

1000 dollars of subsidized tokens.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: