sDPO: Don’t Use Your Data All at Once

Dahyun Kim; Yungi Kim; Wonho Song; Hyeonwoo Kim; Yunsu Kim; Sanghoon Kim; Chanjun Park

2025 COLING COLING 2025

sDPO: Don’t Use Your Data All at Once

Abstract

AbstractAs large language models (LLMs) continue to advance, aligning them with human preferences has become a critical objective. In this paper, we introduce stepwise DPO (sDPO), an innovative extension of the recently popularized Direct Preference Optimization (DPO) technique for alignment tuning. sDPO systematically partitions the available preference datasets and applies them incrementally, rather than utilizing the entire dataset simultaneously. This stepwise manner enables the integration of progressively more aligned reference models within the DPO training framework. Our empirical results demonstrate that sDPO not only enhances the alignment precision of reference models but also significantly improves the overall performance of the final model, surpassing other prominent LLMs with larger parameter counts.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — stepwise training

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dahyun Kim , Yungi Kim , Wonho Song , Hyeonwoo Kim , Yunsu Kim , Sanghoon Kim , Chanjun Park

Topics

Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Learning Types > Reinforcement Learning Artificial Intelligence > Core AI > Large Language Models Deep Learning > Learning Types > Deep Learning

Keywords

direct preference optimization alignment tuning preference dataset stepwise training large language model reference model

Download PDF

Related papers

Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection 2025

TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution 2025

Positive Text Reframing under Multi-strategy Optimization 2025

RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration 2025

Two-stage Incomplete Utterance Rewriting on Editing Operation 2025