Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions

Sébastien Bubeck; Thomas Budzinski; Mark Sellke

2021 COLT COLT 2021

Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions

Abstract

We consider the cooperative multi-player version of the stochastic multi-armed bandit problem. We study the regime where the players cannot communicate but have access to shared randomness. In prior work by the first two authors, a strategy for this regime was constructed for two players and three arms, with regret $\tilde{O}(\sqrt{T})$, and with no collisions at all between the players (with very high probability). In this paper we show that these properties (near-optimal regret and no collisions at all) are achievable for any number of players and arms. At a high level, the previous strategy heavily relied on a 2-dimensional geometric intuition that was difficult to generalize in higher dimensions, while here we take a more combinatorial route to build the new strategy.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Mathematics & Optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sébastien Bubeck , Thomas Budzinski , Mark Sellke

Topics

Artificial Intelligence > Core AI > Multi-Agent Systems Machine Learning > Optimization & Theory > Optimization Mathematics & Optimization > Optimization > Stochastic Methods

Keywords

stochastic process collision avoidance multi-armed bandit regret bound cooperative learning

Download PDF

Related papers

SGD Generalizes Better Than GD (And Regularization Doesn’t Help) 2021

Learning in Matrix Games can be Arbitrarily Complex 2021

Reconstructing weighted voting schemes from partial information about their power indices 2021

Online Learning from Optimal Actions 2021

Robust learning under clean-label attack 2021