arXiv cs.AIWednesday · May 27, 2026FREE

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

agentsbenchmarkenterpriseconstraint-optimization

Anchor, introduced in arXiv paper 2605.26321, addresses artifact drift in AI agent benchmark generation—a failure mode where loosely coupled creation of instructions, environments, oracles, and verifiers leads to unsolvable or inconsistent tasks. The pipeline formalizes domain experts' business workflow specifications into constraint optimization programs. From a single parametric specification, it jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. Altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. The authors apply Anchor to create ERP-Bench, a benchmark of 300 long-horizon enterprise resource planning tasks. This approach ensures that tasks are solvable, verifiable, and consistent, enabling reliable evaluation of AI agents in enterprise settings.

// why it matters

Anchor enables reliable, scalable evaluation of AI agents for enterprise automation.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation — aigest.dev