Hugging FaceThursday · May 28, 2026FREE

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

benchmarkagentsenterpriseIT

IBM Research and Artificial Analysis have introduced ITBench-AA, a benchmark designed to evaluate AI agents on real-world enterprise IT operations tasks. The benchmark covers scenarios such as incident management, compliance checks, and system troubleshooting. Frontier models, including GPT-4 and Claude, scored below 50% accuracy, indicating that current AI agents are not yet reliable for autonomous IT operations. The benchmark is publicly available on Hugging Face, allowing researchers to test and improve their models. This release underscores the gap between AI capabilities and the demands of enterprise IT environments.

// why it matters

Developers building AI agents for IT automation must address significant accuracy gaps before deployment.

Sources

Primary · Hugging Face

▸ Read original at huggingface.co

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape Laguna M.1/XS.2 Technical Report Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Sources

Related

Like this? Get the next digest.