arXiv cs.AISaturday · May 23, 2026FREE

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

llm-agentsevaluationbenchmarkstaxonomy

AgentAtlas, from a paper on arXiv (2605.20530), addresses fragmentation in LLM agent evaluation by proposing four components: a six-state control-decision taxonomy (Act, Ask, Refuse, Stop, Confirm, Recover), a nine-category trajectory-failure taxonomy with orthogonal labels (primary error source, impact), a taxonomy-aware vs. taxonomy-blind methodology to measure prompt supervision effects, and a benchmark-coverage audit mapping 15 agent benchmarks against six behavioral axes. The methodology was demonstrated on a fixed set of eight models (four frontier closed, four open-weight) with 1,342 generated items. The work builds on 2024-2025 consensus that single accuracy metrics are insufficient for deployable agents.

// why it matters

Provides a structured framework for evaluating agent reliability beyond task success.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration $ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems Open-World Evaluations for Measuring Frontier AI Capabilities

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Sources

Related

Like this? Get the next digest.