Joint Multimodal Preference Optimization for Fine-Grained Visual-Textual Alignment
Abstract
AbstractRecent research has focused on addressing multimodal hallucinations in Large Vision-Language Models (LVLMs) by extending Direct Preference Optimization (DPO) to incorporate visual preference supervision. However, these methods often lack fine-grained visual contrast mechanisms and rely on single-margin optimization. This in turn limits their ability to capture precise visual semantics and results in weak multimodal alignment. To address these issues, we propose Joint Multimodal Preference Optimization (JoMPO), a novel optimization framework that symmetrically integrates a text-conditioned preference loss with a visual ranking-based objective. JoMPO leverages semantically contrastive image–text pairs and listwise ranking over multiple visual contexts, enabling fine-grained visual grounding and more robust cross-modal alignment. To support this framework, we introduce the Visual–Textual Contrast (VTC) dataset, consisting of image pairs that are semantically similar but visually distinct, each paired with a contextually grounded textual response. When trained with only 5k contrastive pairs, JoMPO consistently demonstrates superior performance across diverse benchmarks, highlighting its effectiveness in mitigating hallucinations and improving image-text alignment in LVLMs.