Design and Report Benchmarks for Knowledge Work
The paper, posted on arXiv (2605.23262v1) on May 25, 2026, addresses the gap between LLM agent benchmark performance and actual knowledge work capability. It argues that current evaluations follow traditional NLP logic, leading to scores that don't predict real-world effectiveness. The authors propose a three-step approach: defining the work activity, specifying the tested setting (including materials, tools, roles, and constraints), and scoring the appropriate work product. They review work studies showing knowledge work is organized through roles, local materials, and artifacts that must remain usable in downstream workflows. The guidance covers mapping tasks to work activities, specifying settings, and focusing scoring on the system's work product. This work is relevant for developers building LLM agents for coding, research, and healthcare, as it provides a framework for more meaningful evaluation.
Developers need better benchmarks to ensure LLM agents perform in real-world knowledge work, not just NLP tasks.