ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
IBM Research and Artificial Analysis have introduced ITBench-AA, a benchmark designed to evaluate AI agents on real-world enterprise IT operations tasks. The benchmark covers scenarios such as incident management, compliance checks, and system troubleshooting. Frontier models, including GPT-4 and Claude, scored below 50% accuracy, indicating that current AI agents are not yet reliable for autonomous IT operations. The benchmark is publicly available on Hugging Face, allowing researchers to test and improve their models. This release underscores the gap between AI capabilities and the demands of enterprise IT environments.
// why it matters
Developers building AI agents for IT automation must address significant accuracy gaps before deployment.