An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Richard S. Sutton; A. Rupam Mahmood; Martha White

2016 JMLR JMLR 2016

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Abstract

In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD($\lambda$)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per- step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD($\lambda$), and GQ$\lambda$). Compared to these methods, our emphatic TD($\lambda$) is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state- dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states. [abs] [ pdf ][ bib ] © JMLR 2016. (edit, beta)

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

📈 Trend Setter — Offline RL

🧭 Keyword Pioneer — emphatic approach

🐣 Hot Topic Early Bird — reinforcement learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Richard S. Sutton , A. Rupam Mahmood , Martha White

Topics

Machine Learning > Optimization & Theory > Learning Theory Reinforcement Learning > Methods > Deep RL Reinforcement Learning > Methods > Offline RL Reinforcement Learning > Methods > Policy Learning Machine Learning > Learning Types > Reinforcement Learning

Keywords

reinforcement learning temporal difference learning function approximation value function temporal-difference learning off-policy learning emphatic approach

Download PDF

Related papers

Trend Filtering on Graphs 2016

Causal Inference through a Witness Protection Program 2016

A Characterization of Linkage-Based Hierarchical Clustering 2016

How to Center Deep Boltzmann Machines 2016

Minimax Rates in Permutation Estimation for Feature Matching 2016