On the Convergence of Optimistic Policy Iteration

John N. Tsitsiklis

2002 JMLR JMLR 2002

On the Convergence of Optimistic Policy Iteration

Abstract

We consider a finite-state Markov decision problem and establish the convergence of a special case of optimistic policy iteration that involves Monte Carlo estimation of Q -values, in conjunction with greedy policy selection. We provide convergence results for a number of algorithmic variations, including one that involves temporal difference learning (bootstrapping) instead of Monte Carlo estimation. We also indicate some extensions that either fail or are unlikely to go through. [abs] [pdf] [ps.gz] [ps]

🌱 Topic Pioneer — Policy Learning

📈 Trend Setter — Policy Learning

🧭 Keyword Pioneer — temporal difference learning

🐣 Hot Topic Early Bird — temporal difference learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

John N. Tsitsiklis

Topics

Reinforcement Learning > Methods > Policy Learning Reinforcement Learning > Applications > Value Iteration

Keywords

temporal difference learning markov decision process monte carlo estimation policy iteration greedy policy

Download PDF

Related papers

Kernel Independent Component Analysis 2002

Memory-Based Shallow Parsing 2002

Covering Number Bounds of Certain Regularized Linear Function Classes 2002

The Subspace Information Criterion for Infinite Dimensional Hypothesis Spaces 2002

The Set Covering Machine 2002