Matrix games with bandit feedback

Brendan O’Donoghue; Tor Lattimore; Ian Osband

2021 UAI UAI 2021

Matrix games with bandit feedback

Abstract

We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff. This generalizes the usual matrix game, where the payoff matrix is known to the players. Despite numerous applications, this problem has received relatively little attention. Although adversarial bandit algorithms achieve low regret, they do not exploit the matrix structure and perform poorly relative to the new algorithms. The main contributions are regret analyses of variants of UCB and K-learning that hold for any opponent, e.g., even when the opponent adversarially plays the best-response to the learner’s mixed strategy. Along the way, we show that Thompson fails catastrophically in this setting and provide empirical comparison to existing algorithms.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Mathematics & Optimization

🐣 Hot Topic Early Bird — bandit feedback

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Brendan O’Donoghue , Tor Lattimore , Ian Osband

Topics

Artificial Intelligence > Core AI > Game AI Mathematics & Optimization > Optimization > Stochastic Methods Mathematics & Optimization > Optimization > Online Algorithms Mathematics & Optimization > Optimization > Game Theory Machine Learning > Learning Types > Multi-Armed Bandits

Keywords

regret analysis game theory bandit feedback ucb algorithm multi-armed bandit regret bound zero-sum game matrix game

Download PDF

Related papers

Efficient greedy coordinate descent via variable partitioning 2021

Multi-output Gaussian Processes for uncertainty-aware recommender systems 2021

Constrained differentially private federated learning for low-bandwidth devices 2021

A weaker faithfulness assumption based on triple interactions 2021

Generalized parametric path problems 2021