arXiv cs.AITuesday · May 26, 2026FREE

Stop Comparing LLM Agents Without Disclosing the Harness

agentsllmevaluationharness

A position paper on arXiv (2605.23950) argues that for long-horizon tasks evaluated across frontier-capability models, the agent execution harness—the infrastructure layer governing context construction, tool interaction, orchestration, and verification—is often a stronger determinant of agent performance than the model it wraps. The authors formalize the Binding Constraint Thesis: in this regime, performance variance is governed more by harness configuration than by model choice, and current evaluation protocols systematically misattribute harness-level gains to model improvements. They support this with a control-theoretic formalization treating the harness as the controller of a closed-loop dynamical system and the LLM as the stochastic policy it governs, explaining why small harness changes can produce performance shifts exceeding those from model substitution. Published benchmarks, industry deployments, and a controlled variance decomposition show harness-induced variance can substantially exceed model-induced variance. The paper calls for mandatory disclosure of harness details in agent evaluations to enable fair comparison.

// why it matters

Developers must scrutinize harness design, not just model choice, for agent performance.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.