Regularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation

Tingting Zhao; Gang Niu; Ning Xie; Jucheng Yang; Masashi Sugiyama

2015 ACML ACML 2015

Regularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation

Abstract

Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces, which update the policy parameters along the steepest direction of the expected return. However, large variance of policy gradient estimation often causes instability of policy update. In this paper, we propose to suppress the variance of gradient estimation by directly employing the variance of policy gradients as a regularizer. Through experiments, we demonstrate that the proposed variance-regularization technique combined with parameter-based exploration and baseline subtraction provides more reliable policy updates than non-regularized counterparts.

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

📈 Trend Setter — Reinforcement Learning

🧭 Keyword Pioneer — expected return

🐣 Hot Topic Early Bird — reinforcement learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics

Authors

Tingting Zhao , Gang Niu , Ning Xie , Jucheng Yang , Masashi Sugiyama

Topics

Machine Learning > Optimization & Theory > Optimization Reinforcement Learning > Methods > Deep RL Machine Learning > Learning Types > Reinforcement Learning

Keywords

reinforcement learning policy gradient variance reduction continuous action space expected return

Download PDF

Related papers

Continuous Target Shift Adaptation in Supervised Learning 2015

Surrogate regret bounds for generalized classification performance metrics 2015

Statistical Unfolded Logic Learning 2015

Integration of Single-view Graphs with Diffusion of Tensor Product Graphs for Multi-view Spectral Clustering 2015

Class-prior Estimation for Learning from Positive and Unlabeled Data 2015