Online learning in episodic Markovian decision processes by relative entropy policy search

Alexander Zimin; Gergely Neu

2013 NIPS NeurIPS 2013

Online learning in episodic Markovian decision processes by relative entropy policy search

Abstract

We study the problem of online learning in finite episodic Markov decision processes where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space $\A$ and the state space $\X$ has a layered structure with $L$ layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after $T$ episodes is $2\sqrt{L\nX\nA T\log(\nX\nA/L)}$ in the bandit setting and $2L\sqrt{T\log(\nX\nA/L)}$ in the full information setting. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions.

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization and Reinforcement Learning

🧭 Keyword Pioneer — relative entropy policy search

🐣 Hot Topic Early Bird — policy optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

📈 Trend Setter — Online Learning

Authors

Alexander Zimin , Gergely Neu

Topics

Machine Learning > Optimization & Theory > Learning Theory Reinforcement Learning > Methods > Policy Learning Mathematics & Optimization > Optimization > Online Algorithms Machine Learning > Learning Types > Online Learning Machine Learning > Learning Types > Reinforcement Learning Machine Learning > Optimization & Theory > Online Algorithms Machine Learning > Learning Paradigms > Online Learning

Keywords

regret analysis reinforcement learning online learning policy optimization markov decision processes markov decision process regret minimization policy search relative entropy policy search bandit setting regret bound

Download PDF

Related papers

Latent Structured Active Learning 2013

On Flat versus Hierarchical Classification in Large-Scale Taxonomies 2013

Generalized Method-of-Moments for Rank Aggregation 2013

Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections 2013

Accelerated Mini-Batch Stochastic Dual Coordinate Ascent 2013