Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Hassan SABER; Fabien Pesquerel; Odalric-ambrym Maillard; Mohammad Sadegh Talebi

2023 ACML ACML 2023

Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Abstract

We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization and Reinforcement Learning

🧭 Keyword Pioneer — communicating markov decision process

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Hassan SABER , Fabien Pesquerel , Odalric-ambrym Maillard , Mohammad Sadegh Talebi

Topics

Machine Learning > Optimization & Theory > Stochastic Processes Reinforcement Learning > Methods > Policy Learning Mathematics & Optimization > Optimization > Online Algorithms

Keywords

regret minimization bandit algorithm communicating markov decision process indexed minimum empirical divergence

Download PDF

Related papers

How GAN Generators can Invert Networks in Real-Time 2023

ProtoDiffusion: Classifier-Free Diffusion Guidance with Prototype Learning 2023

BarlowRL: Barlow Twins for Data-Efficient Reinforcement Learning 2023

Enhancing Cross-Category Learning in Recommendation Systems with Multi-Layer Embedding Training 2023

Deep Representation Learning for Prediction of Temporal Event Sets in the Continuous Time Domain 2023