MAVEN: Improving Generalization in Agentic Tool Calling
MAVEN, introduced in a June 2026 arXiv paper, addresses generalization challenges in agentic tool-calling environments. It is a symbolic reasoning scaffold that enables structured decomposition, adaptive tool orchestration, and intermediate verification. Evaluated on benchmarks including BFCL v3, TauBench, Tau2Bench, and AceBench, MAVEN also introduces MAVEN-Bench, a stress-test for multi-step mathematical and physical reasoning with adversarial task composition. On MAVEN-Bench, MAVEN improved its GPT-OSS-120b base model from 48% to 71% accuracy without additional training, and remained competitive with frontier proprietary baselines while using an open-weight backbone. The paper highlights a gap between partial reasoning quality and end-to-end task success.
MAVEN enables open-weight models to match proprietary performance in tool-calling tasks.