Probably a more interesting benchmark is one that is scored based on the LLM fin...

		operatingthetan 26 days ago \| parent \| context \| favorite \| on: Exploiting the most prominent AI agent benchmarks Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.