As part of the Holistic Agent Leaderboard (HAL) initiative at Princeton CITP, we...

As part of the Holistic Agent Leaderboard (HAL) initiative at Princeton CITP, we evaluated more than 220 agent runs across 9 benchmarks, the equivalent of over 20,000 agent rollouts across 9 models and 9 benchmarks for a total cost of $40,000. The benchmarks are: AssistantBench, CORE-Bench Hard, GAIA, Online Mind2Web, Scicode, ScienceAgentBench, SWE-bench Verified Mini, TAU-bench Airline, and USACO.

In that process, we “burned” 2.6 billion prompt tokens and learned a lot along the way. In this article, I’d like to share some of the insights we gained, with a particular focus on the GAIA benchmark.