Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As part of the Holistic Agent Leaderboard (HAL) initiative at Princeton CITP, we evaluated more than 220 agent runs across 9 benchmarks, the equivalent of over 20,000 agent rollouts across 9 models and 9 benchmarks for a total cost of $40,000. The benchmarks are: AssistantBench, CORE-Bench Hard, GAIA, Online Mind2Web, Scicode, ScienceAgentBench, SWE-bench Verified Mini, TAU-bench Airline, and USACO.

In that process, we “burned” 2.6 billion prompt tokens and learned a lot along the way. In this article, I’d like to share some of the insights we gained, with a particular focus on the GAIA benchmark.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: