It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers
A study on arXiv (2605.26731) tested the hypothesis that LLM agent harness complexity should inversely correlate with model capability. Researchers ran 432 trials on HEAT-24, a 24-task synthetic benchmark with git-based verification, using six models across four tiers (frontier chat, frontier reasoning, strong open, constrained) under light, balanced, and strict harness conditions. Results show a non-monotone relationship: for Gemini 2.5 Flash (frontier chat), increased harness verbosity reduced VTSR by 29-38 percentage points, a harness-complexity paradox. For Qwen3.5-122B (frontier reasoning with extended thinking), strict harness yielded the highest VTSR (91.7%) and lowest latency, opposite the prediction. In the constrained tier, a 2B model (Gemma4:e2B) matched strong-open-tier stability at 91.7% across all harnesses. The findings challenge prevailing deployment assumptions and suggest optimal harness design depends on model reasoning type rather than capability tier alone.
Developers must tune agent harness structure per model reasoning type, not just capability tier.