How to Stop Shipping Low-Quality RL Environments (with Examples)
The article, based on years of examining RL trajectories, identifies recurring problems in custom environments that actively harm model training. Key issues include reward functions that incentivize unintended shortcuts (reward hacking), physics simulations that don't match real-world dynamics, and termination conditions that cut off learning prematurely. For instance, an environment might reward a robot for moving its arm but fail to penalize it for exploiting a glitch to reach the target without proper motion. Such flaws lead to models that perform well in simulation but fail in deployment. The author provides concrete examples and suggests fixes like reward shaping, domain randomization, and thorough testing of edge cases. The piece emphasizes that a broken harness can make the model worse than no training at all, wasting significant computational resources and developer time.
Faulty RL environments waste compute and produce unreliable models that fail in production.