arXiv cs.AIThursday · May 28, 2026FREE

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

multimodalreasoninghallucinationdpoai

A new paper from arXiv introduces Reasoning-Conditioned Direct Preference Optimization (RC-DPO) to mitigate hallucinations in multimodal large reasoning models. The authors identify that standard response-level DPO treats chain-of-thought (CoT) and final answer as a monolithic output, effectively learning only answer-level preferences. RC-DPO explicitly formulates a CoT-oriented preference term, modeling the CoT as a condition for answer generation and contrasting preferences for the same preferred answer under different CoT conditions. This promotes alignment of reasoning chains that support correct answers. To generate effective training data, the paper employs Monte Carlo Tree Search to discover visually grounded reasoning paths. The method is evaluated on multimodal reasoning benchmarks, showing reduced hallucination rates and improved reasoning quality compared to baseline DPO approaches. The work addresses a critical gap in training-based hallucination mitigation for reasoning models.

// why it matters

Developers building multimodal reasoning systems can reduce hallucinations by explicitly optimizing chain-of-thought reasoning.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.