Hugging FaceThursday · May 28, 2026FREE

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

benchmarkagentsenterpriseIT

IBM Research and Artificial Analysis have introduced ITBench-AA, a benchmark designed to evaluate AI agents on real-world enterprise IT operations tasks. The benchmark covers scenarios such as incident management, compliance checks, and system troubleshooting. Frontier models, including GPT-4 and Claude, scored below 50% accuracy, indicating that current AI agents are not yet reliable for autonomous IT operations. The benchmark is publicly available on Hugging Face, allowing researchers to test and improve their models. This release underscores the gap between AI capabilities and the demands of enterprise IT environments.

// why it matters

Developers building AI agents for IT automation must address significant accuracy gaps before deployment.

Sources

Primary · Hugging Face
▸ Read original at huggingface.co

Like this? Get the next digest.