Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Qiang Liu; Lihong Li; Ziyang Tang; Dengyong Zhou

2018 NIPS NeurIPS 2018

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Abstract

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Reinforcement Learning

📈 Trend Setter — Offline RL

🧭 Keyword Pioneer — off-policy estimation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Qiang Liu , Lihong Li , Ziyang Tang , Dengyong Zhou

Topics

Artificial Intelligence > Learning Paradigms > Transfer Learning Machine Learning > Optimization & Theory > Stochastic Processes Reinforcement Learning > Methods > Offline RL Machine Learning > Learning Types > Reinforcement Learning Machine Learning > Learning Types > Stochastic Methods

Keywords

reinforcement learning importance sampling stationary distribution density ratio off-policy estimation infinite horizon

Download PDF

Related papers

Maximum Causal Tsallis Entropy Imitation Learning 2018

Recurrent World Models Facilitate Policy Evolution 2018

Bandit Learning in Concave N-Person Games 2018

Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation 2018

PAC-Bayes bounds for stable algorithms with instance-dependent priors 2018