Dynamic Policy Programming with Function Approximation

Mohammad Gheshlaghi azar; Vicenç Gómez; Bert Kappen

2011 AISTATS AISTATS 2011

Dynamic Policy Programming with Function Approximation

Abstract

In this paper, we consider the problem of planning in the infinite-horizon discounted-reward Markov decision problems. We propose a novel iterative method, called dynamic policy programming (DPP), which updates the parametrized policy by a Bellman-like iteration. For discrete state-action case, we establish sup-norm loss bounds for the performance of the policy induced by DPP and prove that it asymptotically converges to the optimal policy. Then, we generalize our approach to large-scale (continuous) state-action problems using function approximation technique. We provide sup-norm performance-loss bounds for approximate DPP and compare these bounds with the standard results from approximate dynamic programming (ADP) showing that approximate DPP results in a tighter asymptotic bound than standard ADP methods. We also numerically compare the performance of DPP to other ADP and RL methods. We observe that approximate DPP asymptotically outperforms other methods on the mountain-car problem.

🐣 Hot Topic Early Bird — reinforcement learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Mohammad Gheshlaghi azar , Vicenç Gómez , Bert Kappen

Topics

Machine Learning > Optimization & Theory > Optimization Machine Learning > Learning Types > Reinforcement Learning Machine Learning > Learning Types > Multi-Armed Bandits

Keywords

reinforcement learning policy optimization function approximation markov decision process dynamic programming policy iteration

Download PDF

Related papers

Tighter Relaxations for MAP-MRF Inference: A Local Primal-Dual Gap based Separation Algorithm 2011

Deep Learners Benefit More from Out-of-Distribution Examples 2011

Bagged Structure Learning of Bayesian Network 2011

Convergent Decomposition Solvers for Tree-reweighted Free Energies 2011

Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization 2011