JobBench: Aligning Agent Work With Human Will
JobBench, introduced in a new arXiv paper, shifts the focus of AI agent evaluation from economic replacement to human empowerment. The benchmark includes 130 agentic tasks across 35 occupations, each packaged as a workspace with heterogeneous reference files that mimic real professional environments. Agents are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. In evaluations of 36 models, the top performer—Claude Opus 4.7 under Claude Code—achieved only 45.9% accuracy. This low score underscores the difficulty of current AI agents in handling cluttered information streams and complex workflows. The authors hope JobBench will steer the community toward building agents that enhance human work rather than replace it.
Shifts AI agent evaluation from economic replacement to human-centered delegation.