On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Alekh Agarwal; Sham M. Kakade; Jason D. Lee; Gaurav Mahajan

2021 JMLR JMLR 2021

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Abstract

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case --- which avoid explicit worst-case dependencies on the size of state space --- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number). [abs] [ pdf ][ bib ] © JMLR 2021. (edit, beta)

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — parametric policy

🐣 Hot Topic Early Bird — global convergence

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Alekh Agarwal , Sham M. Kakade , Jason D. Lee , Gaurav Mahajan

Topics

Machine Learning > Optimization & Theory > Learning Theory Machine Learning > Optimization & Theory > Theory Reinforcement Learning > Methods > Policy Learning

Keywords

policy gradient global convergence distribution shift approximation error parametric policy

Download PDF

Related papers

Optimal Feedback Law Recovery by Gradient-Augmented Sparse Polynomial Regression 2021

Normalizing Flows for Probabilistic Modeling and Inference 2021

Determining the Number of Communities in Degree-corrected Stochastic Block Models 2021

Guided Visual Exploration of Relations in Data Sets 2021

Safe Policy Iteration: A Monotonically Improving Approximate Policy Iteration Approach 2021