Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

Katherine Metcalf; Miguel Sarabia; Natalie Mackraz; Barry-John Theobald

2023 CORL CoRL 2023

Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

Abstract

Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that encoding environment dynamics in the reward function improves the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) encoding environment dynamics in a state-action representation $z^{sa}$ via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from $z^{sa}$, which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover $83%$ and $66%$ of ground truth reward policy performance versus only $38%$ and $21%$ without environment dynamics. The performance gains demonstrate that explicitly encoding environment dynamics improves preference-learned reward functions.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Reinforcement Learning

📈 Trend Setter — Preference Learning

🧭 Keyword Pioneer — dynamics encoding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Katherine Metcalf , Miguel Sarabia , Natalie Mackraz , Barry-John Theobald

Topics

Artificial Intelligence > Learning Paradigms > Few-Shot Learning Reinforcement Learning > Methods > Deep RL Machine Learning > Learning Types > Preference Learning

Keywords

sample efficiency self-supervised learning reward function preference-based reinforcement learning dynamics encoding

Download PDF

Related papers

Stochastic Occupancy Grid Map Prediction in Dynamic Scenes 2023

SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning 2023

Robot Parkour Learning 2023

Task-Oriented Koopman-Based Control with Contrastive Encoder 2023

Language-Guided Traffic Simulation via Scene-Level Diffusion 2023