Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning

Lu Chen; Rui Zheng; Binghai Wang; Senjie Jin; Caishuang Huang; Junjie Ye; Zhihao Zhang; Yuhao Zhou; Zhiheng Xi; Tao Gui; Qi Zhang; Xuanjing Huang

2024 EMNLP EMNLP 2024

Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning

Abstract

AbstractReinforcement Learning from Human Feedback (RLHF) is a crucial approach to aligning language models with human values and intentions. A fundamental challenge in this method lies in ensuring that the reward model accurately understands and evaluates human preferences. Current methods rely on ranking losses to teach the reward model to assess preferences, but they are susceptible to noise and ambiguous data, often failing to deeply understand human intentions. To address this issue, we introduce contrastive learning into the reward modeling process. In addition to supervised ranking loss, we introduce an unsupervised contrastive loss to enable the reward model to fully capture the distinctions in contrastive data. Experimental results demonstrate that the proposed contrastive learning-based reward modeling method effectively enhances the generalization of the reward model, stabilizes the reinforcement learning training process, and improves the final alignment with human preferences.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Reinforcement Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Lu Chen , Rui Zheng , Binghai Wang , Senjie Jin , Caishuang Huang , Junjie Ye , Zhihao Zhang , Yuhao Zhou , Zhiheng Xi , Tao Gui , Qi Zhang , Xuanjing Huang

Topics

Machine Learning > Learning Types > Contrastive Learning Machine Learning > Optimization & Theory > Loss Functions Reinforcement Learning > Methods > Policy Learning Deep Learning > Learning Types > Contrastive Learning Deep Learning > Learning Types > Reinforcement Learning Machine Learning > Learning Types > Reinforcement Learning from Human Feedback

Keywords

contrastive learning reinforcement learning reward modeling preference alignment language model alignment reinforcement learning from human feedback preference modeling reward model

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024