2023 UAI UAI 2023

Modified Retrace for Off-Policy Temporal Difference Learning

Abstract

Off-policy learning is a key to extend reinforcement learning as it allows to learn a target policy from a different behavior policy that generates the data. However, it is well known as “the deadly triad” when combined with bootstrapping and function approximation. Retrace is an efficient and convergent off-policy algorithm with tabular value functions which employs truncated importance sampling ratios. Unfortunately, Retrace is known to be unstable with linear function approximation. In this paper, we propose modified Retrace to correct the off-policy return, derive a new off-policy temporal difference learning algorithm (TD-MRetrace) with linear function approximation, and obtain a convergence guarantee under standard assumptions. Experimental results on counterexamples and control tasks validate the effectiveness of the proposed algorithm compared with traditional algorithms.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio