arXiv cs.AITuesday · May 26, 2026FREE

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

multi-agentllmreinforcement-learningworkflows

A new paper on arXiv (2605.24202) investigates when end-to-end reinforcement learning (RL) training of multi-agent LLM workflows improves over base models. The study compares Shared-Policy training (all roles update one policy) with Isolated-Policy training (each role has its own parameters) across Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and model scales of 0.6B, 1.7B, and 4B parameters. Key findings: multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. The strongest patterns are explained through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts can lead to divergent gradients and instability.

// why it matters

Developers must choose training strategies carefully, as policy sharing affects failure modes and peak performance.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.