arXiv cs.AISaturday · May 23, 2026FREE

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

dporlhfalignmentpreference-optimization

A paper on arXiv (2605.20834) reveals that the theoretical equivalence between Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) is conditional, not universal. The equivalence relies on an implicit assumption that the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails—which the authors argue is frequent in practice—DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences. This can lead to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. The paper characterizes when the assumption is violated, demonstrates the existence of an undesirable solution space, and proves that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, the authors propose Constrained Preference Optimization (CPO), which augments RLHF with constraints for provable alignment. They also provide a geometric interpretation through soft margin ranking, showing that DPO implements margin ranking with potentially negative targets. The theoretical analysis establishes conditions for alignment and offers practical guidance for practitioners.

// why it matters

DPO may fail to align with human preferences, requiring careful validation or alternative methods.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.