Rethinking DPO-style Diffusion Aligning Frameworks

Xun Wu; Shaohan Huang; Lingjie Jiang; Furu Wei

2025 ICCV ICCV 2025

Rethinking DPO-style Diffusion Aligning Frameworks

Abstract

Direct preference optimization (DPO) has shown success in aligning diffusion models with human preference. However, We identify two potential risks for existing DPO algorithms: First, current DPO methods for estimating the rewards of step-wise intermediate samples are biased, leading to inaccurate preference ordering for step-wise optimization. Second, existing DPO methods may inadvertently increase the sampling probabilities of dispreferred samples, potentially introducing application risks. To address these issues, we propose Revised Direct Preference Optimization (RDPO), a simple but effective step-wise DPO-based text-to-image diffusion model alignment method. By designing a more theoretically grounded and efficient intermediate-step reward estimation and introducing an additional regularization terms to constrain the sampling probability of dispreferred samples, RDPO can achieve more effective and stable text-to-image alignment performance. Our experiments on two datasets, with base models including Stable Diffusion v1.5 and SDXL, demonstrate that RDPO can effectively learn and construct reward signals for each step of the model, improving alignment performance while ensuring better generalization.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xun Wu , Shaohan Huang , Lingjie Jiang , Furu Wei

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Optimization & Theory > Optimization Deep Learning > Models > Diffusion Models Machine Learning > Learning Types > Reinforcement Learning Machine Learning > Learning Types > Preference Learning

Keywords

direct preference optimization text-to-image generation model alignment diffusion model human preference reward estimation

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025