Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

Omar Darwiche Domingues; Pierre Menard; Emilie Kaufmann; Michal Valko

2021 ALT ALT 2021

Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

Abstract

In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the \emph{non-stationary case} in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a lower bound of $\Omega((H^3SA/\epsilon^2)\log(1/\delta))$ on the sample complexity of an $(\varepsilon,\delta)$-PAC algorithm for best policy identification in a non-stationary MDP, relying on a construction of “hard MDPs” which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the $\Omega(\sqrt{H^3SAT})$ regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Omar Darwiche Domingues , Pierre Menard , Emilie Kaufmann , Michal Valko

Topics

Machine Learning > Optimization & Theory > Learning Theory Reinforcement Learning > Methods > Deep RL

Keywords

sample complexity regret bound minimax lower bound non-stationary mdp episodic mdp pac algorithm

Download PDF

Related papers

Statistical guarantees for generative models without domination 2021

Last-Iterate Convergence Rates for Min-Max Optimization: Convergence of Hamiltonian Gradient Descent and Consensus Optimization 2021

Stochastic Dueling Bandits with Adversarial Corruption 2021

Asymptotically Optimal Strategies For Combinatorial Semi-Bandits in Polynomial Time 2021

Efficient sampling from the Bingham distribution 2021