When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
A new paper on arXiv (2605.24202) investigates when end-to-end reinforcement learning (RL) training of multi-agent LLM workflows improves over base models. The study compares Shared-Policy training (all roles update one policy) with Isolated-Policy training (each role has its own parameters) across Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and model scales of 0.6B, 1.7B, and 4B parameters. Key findings: multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. The strongest patterns are explained through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts can lead to divergent gradients and instability.
Developers must choose training strategies carefully, as policy sharing affects failure modes and peak performance.