Aligning Large Language Models via Fine-grained Supervision

Dehong Xu; Liang Qiu; Minseok Kim; Faisal Ladhak; Jaeyoung Do

2024 ACL ACL 2024

Aligning Large Language Models via Fine-grained Supervision

Abstract

AbstractPre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learning process. However, because this approach operates on sequence-level feedback, it lacks the precision to identify the exact parts of the output affecting user preferences. To address this gap, we propose a method to enhance LLM alignment through fine-grained token-level supervision. Specifically, we ask annotators to minimally edit less preferred responses within the standard reward modeling dataset to make them more favorable, ensuring changes are made only where necessary while retaining most of the original content. The refined dataset is used to train a token-level reward model, which is then used for training our fine-grained Proximal Policy Optimization (PPO) model. Our experiment results demonstrate that this approach can improve LLM performance by up to 5.1% in terms of win rate against the reference model, compared with the traditional PPO model.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — token-level supervision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dehong Xu , Liang Qiu , Minseok Kim , Faisal Ladhak , Jaeyoung Do

Topics

Artificial Intelligence > Core AI > Foundation Models Reinforcement Learning > Methods > Deep RL Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Fine-Tuning Deep Learning > Learning Types > Reinforcement Learning

Keywords

reinforcement learning reward modeling language model alignment reinforcement learning from human feedback model alignment proximal policy optimization large language model token-level supervision

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024