Before the Model Learns the Bug:Fuzzing RLVR Verifiers
A new paper from arXiv cs.AI introduces a lightweight verifier-fuzzing framework for reinforcement learning with verifiable rewards (RLVR). RLVR replaces human preference labels with executable reward functions like math answer checkers, JSON validators, and code unit-test harnesses. The framework generates adversarial completions, compares buggy and stricter reference verifiers, logs paired decisions, and reports false-positive, false-negative, disagreement, exploit, and uncertainty metrics. This addresses the failure mode where a buggy verifier can be exploited by optimization, leading to reward hacking. The framework is designed to be lightweight and can be integrated into RL training pipelines to catch verifier bugs before they are learned.
Verifier bugs can be exploited by RL optimization, causing reward hacking.