Open-World Evaluations for Measuring Frontier AI Capabilities
A new paper on arXiv (2605.20520) advocates for open-world evaluations—long-horizon, real-world tasks assessed through qualitative analysis—to complement traditional benchmarks. The authors argue that benchmarks can both overstate and understate deployed capability due to their focus on precisely specifiable, automatically gradable tasks. As a first instance of their CRUX project, they tasked an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, demonstrating that open-world evaluations can provide early warning of capabilities that may soon become widespread. The paper surveys recent open-world evaluations, identifies their strengths and limitations, and offers recommendations for designing and reporting such evaluations. Published May 22, 2026.
Developers should prepare for AI agents that can autonomously ship real-world software.