Towards painless policy optimization for constrained MDPs

Arushi Jain; Sharan Vaswani; Reza Babanezhad; Csaba Szepesvári; Doina Precup

2022 UAI UAI 2022

Towards painless policy optimization for constrained MDPs

Abstract

We study policy optimization in an infinite horizon, $\gamma$-discounted constrained Markov decision process (CMDP). Our objective is to return a policy that achieves large expected reward with a small constraint violation. We consider the online setting with linear function approximation and assume global access to the corresponding features. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms in terms of their primal and dual regret on online linear optimization problems. We instantiate this framework to use coin-betting algorithms and propose the \textbf{Coin Betting Politex (CBP)} algorithm. Assuming that the action-value functions are $\epsilon_{\text{\tiny{b}}}$-close to the span of the $d$-dimensional state-action features and no sampling errors, we prove that $T$ iterations of CBP result in an $O\left(\frac{1}{(1 - \gamma)^3 \sqrt{T}} + \frac{\epsilon_{\text{\tiny{b}}} \sqrt{d}}{(1 - \gamma)^2} \right)$ reward sub-optimality and an $O\left(\frac{1}{(1 - \gamma)^2 \sqrt{T}} + \frac{\epsilon_{\text{\tiny{b}}} \sqrt{d}}{1 - \gamma} \right)$ constraint violation. Importantly, unlike gradient descent-ascent and other recent methods, CBP does not require extensive hyperparameter tuning. Via experiments on synthetic and Cartpole environments, we demonstrate the effectiveness and robustness of CBP.

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — coin-betting algorithm

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Arushi Jain , Sharan Vaswani , Reza Babanezhad , Csaba Szepesvári , Doina Precup

Topics

Machine Learning > Optimization & Theory > Learning Theory Machine Learning > Optimization & Theory > Optimization Reinforcement Learning > Methods > Policy Learning

Keywords

policy optimization linear function approximation constrained markov decision process primal-dual method coin-betting algorithm

Download PDF

Related papers

Combating the instability of mutual information-based losses via regularization 2022

Future gradient descent for adapting the temporal shifting data distribution in online recommendation systems 2022

Privacy-aware compression for federated data analysis 2022

Fixing the Bethe approximation: How structural modifications in a graph improve belief propagation 2022

Probabilistic spatial transformer networks 2022