arXiv cs.AIWednesday · May 27, 2026FREE

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

llm-agentsharness-designbenchmarkingarxiv

A study on arXiv (2605.26731) tested the hypothesis that LLM agent harness complexity should inversely correlate with model capability. Researchers ran 432 trials on HEAT-24, a 24-task synthetic benchmark with git-based verification, using six models across four tiers (frontier chat, frontier reasoning, strong open, constrained) under light, balanced, and strict harness conditions. Results show a non-monotone relationship: for Gemini 2.5 Flash (frontier chat), increased harness verbosity reduced VTSR by 29-38 percentage points, a harness-complexity paradox. For Qwen3.5-122B (frontier reasoning with extended thinking), strict harness yielded the highest VTSR (91.7%) and lowest latency, opposite the prediction. In the constrained tier, a 2B model (Gemma4:e2B) matched strong-open-tier stability at 91.7% across all harnesses. The findings challenge prevailing deployment assumptions and suggest optimal harness design depends on model reasoning type rather than capability tier alone.

// why it matters

Developers must tune agent harness structure per model reasoning type, not just capability tier.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers — aigest.dev