An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Ehsan Imani; Eric Graves; Martha White

2018 NIPS NeurIPS 2018

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Abstract

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm—called Actor Critic with Emphatic weightings (ACE)—that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods—particularly OffPAC and DPG—converge to the wrong solution whereas ACE finds the optimal solution.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — emphatic weighting

🐣 Hot Topic Early Bird — off-policy learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ehsan Imani , Eric Graves , Martha White

Topics

Reinforcement Learning > Methods > Deep RL Reinforcement Learning > Methods > Policy Learning Machine Learning > Learning Types > Reinforcement Learning Artificial Intelligence > Core AI > Reinforcement Learning

Keywords

reinforcement learning policy gradient off-policy learning convergence guarantee continuous action emphatic weighting

Download PDF

Related papers

Maximum Causal Tsallis Entropy Imitation Learning 2018

Recurrent World Models Facilitate Policy Evolution 2018

Bandit Learning in Concave N-Person Games 2018

Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation 2018

PAC-Bayes bounds for stable algorithms with instance-dependent priors 2018