Data-Efficient Policy Evaluation Through Behavior Policy Search

Josiah P. Hanna; Yash Chandak; Philip S. Thomas; Martha White; Peter Stone; Scott Niekum

2024 JMLR JMLR 2024

Data-Efficient Policy Evaluation Through Behavior Policy Search

Abstract

We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for a minimal variance behavior policy -- a behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present two behavior policy search algorithms and empirically demonstrate their effectiveness in lowering the mean squared error of policy performance estimates. [abs] [ pdf ][ bib ] © JMLR 2024. (edit, beta)

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Josiah P. Hanna , Yash Chandak , Philip S. Thomas , Martha White , Peter Stone , Scott Niekum

Topics

Reinforcement Learning > Methods > Offline RL Reinforcement Learning > Methods > Policy Learning

Keywords

off-policy evaluation policy search mean squared error behavior policy

Download PDF

Related papers

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks 2024

Convergence for nonconvex ADMM, with applications to CT imaging 2024

Functional Directed Acyclic Graphs 2024

Sum-of-norms clustering does not separate nearby balls 2024

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning 2024