Sublinear Optimal Policy Value Estimation in Contextual Bandits

Weihao Kong; Emma Brunskill; Gregory Valiant

2020 AISTATS AISTATS 2020

Sublinear Optimal Policy Value Estimation in Contextual Bandits

Abstract

We study the problem of estimating the expected reward of the optimal policy in the stochastic disjoint linear bandit setting. We prove that for certain settings it is possible to obtain an accurate estimate of the optimal policy value even with a sublinear number of samples, where a linear set would be needed to reliably estimate the reward that can be obtained by any policy. We establish near matching information theoretic lower bounds, showing that our algorithm achieves near optimal estimation error. Finally, we demonstrate the effectiveness of our algorithm on joke recommendation and cancer inhibition dosage selection problems using real datasets.

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization

🐣 Hot Topic Early Bird — optimal policy

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Weihao Kong , Emma Brunskill , Gregory Valiant

Topics

Machine Learning > Core Methods > Regression Mathematics & Optimization > Optimization > Stochastic Methods

Keywords

optimal policy sublinear sample contextual bandit linear bandit policy value

Download PDF

Related papers

Stretching the Effectiveness of MLE from Accuracy to Bias for Pairwise Comparisons 2020

Fast and Accurate Ranking Regression 2020

Nonparametric Sequential Prediction While Deep Learning the Kernel 2020

Nested-Wasserstein Self-Imitation Learning for Sequence Generation 2020

Unconditional Coresets for Regularized Loss Minimization 2020