2006 NIPS NeurIPS 2006

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Abstract

We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm's online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.

🚀 Conference Pioneer — NIPS 2006
🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning
📈 Trend Setter — Deep RL
🧭 Keyword Pioneer — reinforcement learning theory
🐣 Hot Topic Early Bird — reinforcement learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
🌱 Topic Pioneer — Value Iteration