arXiv cs.AITuesday · May 26, 2026FREE

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

benchmarkcoding-agentsmulti-turnevaluation

EvoCode-Bench, introduced in a new arXiv preprint, is a benchmark of 26 stateful coding tasks spanning 227 evaluated rounds. Each task preserves the agent's workspace for 5-15 rounds, states requirements through observable behavior, and uses cumulative executable tests to check both new and prior requirements. The study evaluates 13 coding agents using two metrics: MT@4 (multi-turn, four-attempt fail-stop) and SR (single-round from a reference-completed prior state). Results show that for most agents, SR exceeds MT@4 by 22-40 points. The highest SR agent (78.9) ranks only third in MT@4 (44.0). Even the strongest agents achieve only about 50% success on multi-turn metrics, and aggregate pass rate drops below half of round-1 performance by round 5. Failure analysis reveals tier-dependent behavior: weaker agents fail early, while stronger agents survive longer but still struggle with persistent execution.

// why it matters

Highlights that current coding agents falter under evolving requirements, a critical gap for real-world development.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

Stop Comparing LLM Agents Without Disclosing the Harness Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

Sources

Related

Like this? Get the next digest.