How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization

Pierluca D'Oro; Wojciech Jaśkowski

2020 NIPS NeurIPS 2020

How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization

Abstract

Deterministic-policy actor-critic algorithms for continuous control improve the actor by plugging its actions into the critic and ascending the action-value gradient, which is obtained by chaining the actor's Jacobian matrix with the gradient of the critic with respect to input actions. However, instead of gradients, the critic is, typically, only trained to accurately predict expected returns, which, on their own, are useless for policy optimization. In this paper, we propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients, which explicitly learns the action-value gradient. MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning, leading to a critic tailored for policy improvement. On a set of MuJoCo continuous-control tasks, we demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.

❓ The Questioner

🌉 Interdisciplinary Bridge — Deep Learning and Reinforcement Learning

🧭 Keyword Pioneer — action-value gradient

🐣 Hot Topic Early Bird — temporal difference learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Pierluca D'Oro , Wojciech Jaśkowski

Topics

Reinforcement Learning > Methods > Deep RL Reinforcement Learning > Methods > Policy Learning Deep Learning > Learning Types > Reinforcement Learning

Keywords

reinforcement learning temporal difference learning policy optimization policy gradient model-based reinforcement learning model-based rl action-value gradient

Download PDF

Related papers

Higher-Order Spectral Clustering of Directed Graphs 2020

Self-Supervised MultiModal Versatile Networks 2020

Multi-Robot Collision Avoidance under Uncertainty with Probabilistic Safety Barrier Certificates 2020

Causal Intervention for Weakly-Supervised Semantic Segmentation 2020

Taming Discrete Integration via the Boon of Dimensionality 2020