Marginalized Operators for Off-policy Reinforcement Learning

Yunhao Tang; Mark Rowland; Rémi Munos; Michal Valko

2022 AISTATS AISTATS 2022

Marginalized Operators for Off-policy Reinforcement Learning

Abstract

In this work, we propose marginalized operators, a new class of off-policy evaluation operators for reinforcement learning. Marginalized operators strictly generalize generic multi-step operators, such as Retrace, as special cases. Marginalized operators also suggest a form of sample-based estimates with potential variance reduction, compared to sample-based estimates of the original multi-step operators. We show that the estimates for marginalized operators can be computed in a scalable way, which also generalizes prior results on marginalized importance sampling as special cases. Finally, we empirically demonstrate that marginalized operators provide performance gains to off-policy evaluation problems and downstream policy optimization algorithms.

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — multi-step operator

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yunhao Tang , Mark Rowland , Rémi Munos , Michal Valko

Topics

Machine Learning > Optimization & Theory > Stochastic Processes Reinforcement Learning > Methods > Offline RL Machine Learning > Learning Types > Reinforcement Learning Machine Learning > Learning Types > Multi-Agent Systems

Keywords

off-policy evaluation importance sampling variance reduction off-policy reinforcement learning marginalized importance sampling multi-step operator marginalized operator

Download PDF

Related papers

Exploring Image Regions Not Well Encoded by an INN 2022

On Linear Model with Markov Signal Priors 2022

Probabilistic Numerical Method of Lines for Time-Dependent Partial Differential Equations 2022

On Distributionally Robust Optimization and Data Rebalancing 2022

Common Failure Modes of Subcluster-based Sampling in Dirichlet Process Gaussian Mixture Models - and a Deep-learning Solution 2022