Online Learning in Markov Decision Processes with Changing Cost Sequences

Travis Dick; András György; Csaba Szepesvári

2014 ICML ICML 2014

Online Learning in Markov Decision Processes with Changing Cost Sequences

Abstract

In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization and Reinforcement Learning

📈 Trend Setter — Offline RL

🐣 Hot Topic Early Bird — markov decision process

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Travis Dick , András György , Csaba Szepesvári

Topics

Machine Learning > Optimization & Theory > Stochastic Processes Reinforcement Learning > Methods > Offline RL Mathematics & Optimization > Optimization > Online Algorithms

Keywords

online learning markov decision process bandit feedback mirror descent stochastic shortest path regret bound

Download PDF

Related papers

Demystifying Information-Theoretic Clustering 2014

Margins, Kernels and Non-linear Smoothed Perceptrons 2014

Large-Margin Metric Learning for Constrained Partitioning Problems 2014

Efficient Approximation of Cross-Validation for Kernel Methods using Bouligand Influence Function 2014

Generalized Exponential Concentration Inequality for Renyi Divergence Estimation 2014