On Q-learning Convergence for Non-Markov Decision Processes

Sultan Javed Majeed; Marcus Hutter

2018 IJCAI IJCAI 2018

On Q-learning Convergence for Non-Markov Decision Processes

Abstract

Temporal-difference (TD) learning is an attractive, computationally efficient framework for model- free reinforcement learning. Q-learning is one of the most widely used TD learning technique that enables an agent to learn the optimal action-value function, i.e. Q-value function. Contrary to its widespread use, Q-learning has only been proven to converge on Markov Decision Processes (MDPs) and Q-uniform abstractions of finite-state MDPs. On the other hand, most real-world problems are inherently non-Markovian: the full true state of the environment is not revealed by recent observations. In this paper, we investigate the behavior of Q-learning when applied to non-MDP and non-ergodic domains which may have infinitely many underlying states. We prove that the convergence guarantee of Q-learning can be extended to a class of such non-MDP problems, in particular, to some non-stationary domains. We show that state-uniformity of the optimal Q-value function is a necessary and sufficient condition for Q-learning to converge even in the case of infinitely many internal states.

🧭 Keyword Pioneer — non-markov decision process

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

🐣 Hot Topic Early Bird — temporal difference learning

Authors

Sultan Javed Majeed , Marcus Hutter

Topics

Machine Learning > Optimization & Theory > Learning Theory Machine Learning > Optimization & Theory > Optimization Machine Learning > Optimization & Theory > Stochastic Processes Machine Learning > Optimization & Theory > Theory Reinforcement Learning > Methods > Deep RL Machine Learning > Learning Types > Reinforcement Learning

Keywords

reinforcement learning temporal difference learning convergence analysis markov decision process optimal control temporal-difference learning stochastic process convergence guarantee non-markov decision process

Download PDF

Related papers

Semi-Supervised Multi-Modal Learning with Incomplete Modalities 2018

High-dimensional Similarity Learning via Dual-sparse Random Projection 2018

FISH-MML: Fisher-HSIC Multi-View Metric Learning 2018

Generative Warfare Nets: Ensemble via Adversaries and Collaborators 2018

Semi-Supervised Optimal Margin Distribution Machines 2018