EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions
EvoCode-Bench, introduced in a new arXiv preprint, is a benchmark of 26 stateful coding tasks spanning 227 evaluated rounds. Each task preserves the agent's workspace for 5-15 rounds, states requirements through observable behavior, and uses cumulative executable tests to check both new and prior requirements. The study evaluates 13 coding agents using two metrics: MT@4 (multi-turn, four-attempt fail-stop) and SR (single-round from a reference-completed prior state). Results show that for most agents, SR exceeds MT@4 by 22-40 points. The highest SR agent (78.9) ranks only third in MT@4 (44.0). Even the strongest agents achieve only about 50% success on multi-turn metrics, and aggregate pass rate drops below half of round-1 performance by round 5. Failure analysis reveals tier-dependent behavior: weaker agents fail early, while stronger agents survive longer but still struggle with persistent execution.
Highlights that current coding agents falter under evolving requirements, a critical gap for real-world development.