The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models

Yanjun Chen; Dawei Zhu; Yirong Sun; Xinghao Chen; Wei Zhang; Xiaoyu Shen

2024 EMNLP EMNLP 2024

The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models

Abstract

AbstractReinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — accuracy paradox

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yanjun Chen , Dawei Zhu , Yirong Sun , Xinghao Chen , Wei Zhang , Xiaoyu Shen

Topics

Machine Learning > Optimization & Theory > Theory Machine Learning > Application Areas > Fairness Reinforcement Learning > Methods > Policy Learning Deep Learning > Learning Types > Reinforcement Learning Artificial Intelligence > Core AI > Reinforcement Learning Machine Learning > Learning Types > Reinforcement Learning from Human Feedback

Keywords

natural language processing language model alignment reinforcement learning from human feedback model alignment language model reward model accuracy paradox

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024