WPO: Enhancing RLHF with Weighted Preference Optimization

Wenxuan Zhou; Ravi Agrawal; Shujian Zhang; Sathish Reddy Indurthi; Sanqiang Zhao; Kaiqiang Song; Silei Xu; Chenguang Zhu

2024 EMNLP EMNLP 2024

WPO: Enhancing RLHF with Weighted Preference Optimization

Abstract

AbstractReinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 76.7% based on Gemma-2-9b-it. We release the code and models at https://github.com/wzhouad/WPO.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Reinforcement Learning

🐣 Hot Topic Early Bird — llm alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wenxuan Zhou , Ravi Agrawal , Shujian Zhang , Sathish Reddy Indurthi , Sanqiang Zhao , Kaiqiang Song , Silei Xu , Chenguang Zhu

Topics

Artificial Intelligence > Core AI > Foundation Models Reinforcement Learning > Methods > Deep RL Reinforcement Learning > Methods > Offline RL Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Reinforcement Learning from Human Feedback

Keywords

direct preference optimization preference optimization instruction following reinforcement learning from human feedback off-policy learning llm alignment large language model

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024