Off-Policy Confidence Sequences

Nikos Karampatziakis; Paul Mineiro; Aaditya Ramdas

2021 ICML ICML 2021

Off-Policy Confidence Sequences

Abstract

We develop confidence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting. These confidence sequences are based on recent ideas from martingale analysis and are non-asymptotic, non-parametric, and valid at arbitrary stopping times. We provide algorithms for computing these confidence sequences that strike a good balance between computational and statistical efficiency. We empirically demonstrate the tightness of our approach in terms of failure probability and width and apply it to the “gated deployment” problem of safely upgrading a production contextual bandit system.

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — stopping time

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Nikos Karampatziakis , Paul Mineiro , Aaditya Ramdas

Topics

Machine Learning > Optimization & Theory > Statistical Learning Machine Learning > Optimization & Theory > Stochastic Processes Reinforcement Learning > Methods > Deep RL

Keywords

off-policy evaluation contextual bandit confidence sequence stopping time

Download PDF

Related papers

GRAND: Graph Neural Diffusion 2021

Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits 2021

Straight to the Gradient: Learning to Use Novel Tokens for Neural Text Generation 2021

Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution 2021

Dataset Dynamics via Gradient Flows in Probability Space 2021