arXiv cs.AIThursday · May 28, 2026FREE

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

llm-agentsbenchmarkharnessevaluation

Harness-Bench, introduced in arXiv paper 2605.27922, is a diagnostic benchmark designed to measure the impact of harness configurations on LLM agent performance in realistic workflows. Unlike existing benchmarks that abstract away execution or hold the harness fixed, Harness-Bench evaluates configuration-level harness effects across multiple model backends under shared task environments, budgets, and evaluation protocols. The benchmark comprises 106 sandboxed offline tasks constructed from practical agent-use patterns, manually reviewed for realism, solvability, oracle-checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final outcomes. The paper emphasizes that agent performance depends not only on the base model but also on the harness—the system layer managing context, tools, state, constraints, permissions, tracing, and recovery. By preserving each harness's native execution behavior, Harness-Bench allows researchers and developers to isolate and study execution-layer variation, a factor often overlooked in existing evaluations. The benchmark is available on arXiv and is intended to facilitate systematic comparison of harness designs.

// why it matters

Enables systematic study of how harness configurations affect agent performance.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

Sources

Related

Like this? Get the next digest.