Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Chenlu Ye; Wei Xiong; Yuheng Zhang; Hanze Dong; Nan Jiang; Tong Zhang

2024 NIPS NeurIPS 2024

Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Abstract

We investigate Reinforcement Learning from Human Feedback (RLHF) in the context of a general preference oracle. In particular, we do not assume the existence of a reward function and an oracle preference signal drawn from the Bradley-Terry model as most of the prior works do. We consider a standard mathematical formulation, the reverse-KL regularized minimax game between two LLMs for RLHF under general preference oracle. The learning objective of this formulation is to find a policy so that it is consistently preferred by the KL-regularized preference oracle over any competing LLMs. We show that this framework is strictly more general than the reward-based one, and propose sample-efficient algorithms for both the offline learning from a pre-collected preference dataset and online learning where we can query the preference oracle along the way of training. Empirical studies verify the effectiveness of the proposed framework.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Reinforcement Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Chenlu Ye , Wei Xiong , Yuheng Zhang , Hanze Dong , Nan Jiang , Tong Zhang

Topics

Artificial Intelligence > Core AI > Agent Systems Natural Language Processing > Resources & Methods > Large Language Models Reinforcement Learning > Methods > Deep RL Machine Learning > Learning Types > Reinforcement Learning Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Fine-Tuning Machine Learning > Learning Types > Large Language Models Machine Learning > Learning Types > Preference Learning

Keywords

reinforcement learning preference learning policy learning minimax optimization reinforcement learning from human feedback reward function human feedback language model large language model

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024