OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation

Jongmin Lee; Wonseok Jeon; Byungjun Lee; Joelle Pineau; Kee-eung Kim

2021 ICML ICML 2021

OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation

Abstract

We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.

🧭 Keyword Pioneer — distributional shift

🐣 Hot Topic Early Bird — offline reinforcement learning

🐝 Cross-Pollinator — Artificial Intelligence, Data Science & Analytics, Deep Learning, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

Authors

Jongmin Lee , Wonseok Jeon , Byungjun Lee , Joelle Pineau , Kee-eung Kim

Topics

Machine Learning > Learning Types > Unsupervised Learning Reinforcement Learning > Methods > Offline RL Machine Learning > Learning Types > Reinforcement Learning

Keywords

offline reinforcement learning policy optimization markov decision process action value stationary distribution distributional shift

Download PDF

Related papers

GRAND: Graph Neural Diffusion 2021

Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits 2021

Straight to the Gradient: Learning to Use Novel Tokens for Neural Text Generation 2021

Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution 2021

Dataset Dynamics via Gradient Flows in Probability Space 2021