Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

Jiafan He; Dongruo Zhou; Quanquan Gu

2021 NIPS NeurIPS 2021

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

Abstract

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$\gamma$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$\gamma$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $\gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$\gamma$ is nearly minimax optimal for discounted MDPs.

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — bernstein-type bonus

🐣 Hot Topic Early Bird — minimax optimal

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiafan He , Dongruo Zhou , Quanquan Gu

Topics

Machine Learning > Optimization & Theory > Learning Theory Machine Learning > Optimization & Theory > Theory Reinforcement Learning > Methods > Deep RL Machine Learning > Learning Types > Reinforcement Learning

Keywords

reinforcement learning minimax optimal regret bound optimism in face of uncertainty discounted mdp bernstein-type bonus

Download PDF

Related papers

Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data 2021

On Model Calibration for Long-Tailed Object Detection and Instance Segmentation 2021

Test-Time Personalization with a Transformer for Human Pose Estimation 2021

NTopo: Mesh-free Topology Optimization using Implicit Neural Representations 2021

Scalable Intervention Target Estimation in Linear Models 2021