arXiv cs.AIWednesday · May 27, 2026FREE

JobBench: Aligning Agent Work With Human Will

agentsbenchmarkai-safetyarxiv

JobBench, introduced in a new arXiv paper, shifts the focus of AI agent evaluation from economic replacement to human empowerment. The benchmark includes 130 agentic tasks across 35 occupations, each packaged as a workspace with heterogeneous reference files that mimic real professional environments. Agents are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. In evaluations of 36 models, the top performer—Claude Opus 4.7 under Claude Code—achieved only 45.9% accuracy. This low score underscores the difficulty of current AI agents in handling cluttered information streams and complex workflows. The authors hope JobBench will steer the community toward building agents that enhance human work rather than replace it.

// why it matters

Shifts AI agent evaluation from economic replacement to human-centered delegation.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.