Towards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free RL

Arnob Ghosh; Xingyu Zhou; Ness Shroff

2024 AISTATS AISTATS 2024

Towards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free RL

Abstract

We study the constrained Markov decision processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. Existing approaches have primarily focused on \emph{soft} constraint violation, which allows compensation across episodes, making it easier to satisfy the constraints. In contrast, we consider a stronger \emph{hard} constraint violation metric, where only positive constraint violations are accumulated. Our main result is the development of the \emph{first model-free}, \emph{simulator-free} algorithm that achieves a sub-linear regret and a sub-linear hard constraint violation simultaneously, even in \emph{large-scale} systems. In particular, we show that $\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ hard constraint violation bounds can be achieved, where $K$ is the number of episodes, $d$ is the dimension of the feature mapping, $H$ is the length of the episode. Our results are achieved via novel adaptations of the primal-dual LSVI-UCB algorithm, i.e., it searches for the dual variable that balances between regret and constraint violation within every episode, rather than updating it at the end of each episode. This turns out to be crucial for our theoretical guarantees when dealing with hard constraint violations.

🧭 Keyword Pioneer — hard constraint violation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

Authors

Arnob Ghosh , Xingyu Zhou , Ness Shroff

Topics

Machine Learning > Optimization & Theory > Optimization Reinforcement Learning > Methods > Deep RL Reinforcement Learning > Methods > Policy Learning Machine Learning > Learning Types > Reinforcement Learning

Keywords

reinforcement learning policy optimization primal-dual algorithm constrained markov decision process regret bound model-free reinforcement learning hard constraint violation model-free rl

Download PDF

Related papers

Causal Bandits with General Causal Models and Interventions 2024

Boundary-Aware Uncertainty for Feature Attribution Explainers 2024

Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective 2024

A Primal-Dual-Critic Algorithm for Offline Constrained Reinforcement Learning 2024

Pure Exploration in Bandits with Linear Constraints 2024