Toward Minimax Off-policy Value Estimation

Lihong Li; Rémi Munos; Csaba Szepesvári

2015 AISTATS AISTATS 2015

Toward Minimax Off-policy Value Estimation

Abstract

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a finite-time minimax risk lower bound, and analyze the risk of three standard estimators. It is shown that in a large class of settings the so-called regression estimator is minimax optimal up to a constant that depends on the number of actions, while the other two can be arbitrarily worse even in the limit of infinitely many data points, despite their empirical success and popularity. The performance of these estimators are studied in synthetic and real problems; illustrating the nontriviality of this simple task. Finally the results are extended to the problem of off-policy evaluation in contextual bandits and fixed-horizon Markov decision processes.

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

📈 Trend Setter — Offline RL

🧭 Keyword Pioneer — regression estimator

🐣 Hot Topic Early Bird — off-policy evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Lihong Li , Rémi Munos , Csaba Szepesvári

Topics

Machine Learning > Optimization & Theory > Statistical Learning Machine Learning > Optimization & Theory > Theory Reinforcement Learning > Methods > Offline RL

Keywords

off-policy evaluation markov decision process minimax estimation contextual bandit regression estimator

Download PDF

Related papers

Near-optimal max-affine estimators for convex regression 2015

Sparse Solutions to Nonnegative Linear Systems and Applications 2015

Online Optimization : Competing with Dynamic Comparators 2015

Dimensionality estimation without distances 2015

The Security of Latent Dirichlet Allocation 2015