AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
AgentAtlas, from a paper on arXiv (2605.20530), addresses fragmentation in LLM agent evaluation by proposing four components: a six-state control-decision taxonomy (Act, Ask, Refuse, Stop, Confirm, Recover), a nine-category trajectory-failure taxonomy with orthogonal labels (primary error source, impact), a taxonomy-aware vs. taxonomy-blind methodology to measure prompt supervision effects, and a benchmark-coverage audit mapping 15 agent benchmarks against six behavioral axes. The methodology was demonstrated on a fixed set of eight models (four frontier closed, four open-weight) with 1,342 generated items. The work builds on 2024-2025 consensus that single accuracy metrics are insufficient for deployable agents.
Provides a structured framework for evaluating agent reliability beyond task success.