Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

96.4% on LiveCodeBench is impressive but LiveCodeBench is single-shot. The interesting test is multi-turn agentic — has anyone benchmarked DeepSeek V4 Pro vs Opus on SWE-bench Verified or similar where the cheaper model has to be more decisive about tool use over 30+ turns? Curious if there's a cliff at higher tool-call depths.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: