Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Peter Auer; Ronald Ortner

2006 NIPS NeurIPS 2006

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Abstract

We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm's online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.

🚀 Conference Pioneer — NIPS 2006

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

📈 Trend Setter — Deep RL

🧭 Keyword Pioneer — reinforcement learning theory

🐣 Hot Topic Early Bird — reinforcement learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

🌱 Topic Pioneer — Value Iteration

Authors

Peter Auer , Ronald Ortner

Topics

Machine Learning > Optimization & Theory > Learning Theory Reinforcement Learning > Methods > Deep RL Machine Learning > Learning Types > Online Learning Machine Learning > Learning Types > Reinforcement Learning Machine Learning > Optimization & Theory > Online Algorithms Machine Learning > Learning Types > Multi-Armed Bandits Reinforcement Learning > Methods > Value Iteration

Keywords

reinforcement learning online learning reinforcement learning theory undiscounted setting exploration-exploitation tradeoff exploration-exploitation optimal policy upper confidence bound regret bound undiscounted reinforcement learning online regret bound

Download PDF

Related papers

Temporal Coding using the Response Properties of Spiking Neurons 2006

Parameter Expanded Variational Bayesian Methods 2006

Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning 2006

Ordinal Regression by Extended Binary Classification 2006

Blind source separation for over-determined delayed mixtures 2006