On Generalized Bellman Equations and Temporal-Difference Learning

Huizhen Yu; A. Rupam Mahmood; Richard S. Sutton

2018 JMLR JMLR 2018

On Generalized Bellman Equations and Temporal-Difference Learning

Abstract

We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the high variance issue in off-policy TD learning, we propose a new scheme of setting the $\lambda$-parameters of TD, based on generalized Bellman equations. Our scheme is to set $\lambda$ according to the eligibility trace iterates calculated in TD, thereby easily keeping these traces in a desired bounded range. Compared with prior work, this scheme is more direct and flexible, and allows much larger $\lambda$ values for off-policy TD learning with bounded traces. As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of $\lambda$ and the unique invariant probability measure of the state-trace process. These results not only lead immediately to a characterization of the convergence behavior of least-squares based implementation of our scheme, but also prepare the ground for further analysis of gradient-based implementations. [abs] [ pdf ][ bib ] © JMLR 2018. (edit, beta)

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

🐣 Hot Topic Early Bird — off-policy learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Huizhen Yu , A. Rupam Mahmood , Richard S. Sutton

Topics

Machine Learning > Optimization & Theory > Stochastic Processes Reinforcement Learning > Methods > Deep RL

Keywords

markov decision process bellman equation temporal-difference learning off-policy learning eligibility trace

Download PDF

Related papers

Simple Classification Using Binary Data 2018

ELFI: Engine for Likelihood-Free Inference 2018

Refining the Confidence Level for Optimistic Bandit Strategies 2018

An Efficient and Effective Generic Agglomerative Hierarchical Clustering Approach 2018

Convergence of Unregularized Online Learning Algorithms 2018