Batch Reinforcement Learning with Hyperparameter Gradients

Byungjun Lee; Jongmin Lee; Peter Vrancx; Dongho Kim; Kee-eung Kim

2020 ICML ICML 2020

Batch Reinforcement Learning with Hyperparameter Gradients

Abstract

We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the estimation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — held-out datum

🐣 Hot Topic Early Bird — policy optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Data Science & Analytics, Deep Learning, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics

Authors

Byungjun Lee , Jongmin Lee , Peter Vrancx , Dongho Kim , Kee-eung Kim

Topics

Machine Learning > Optimization & Theory > Optimization Reinforcement Learning > Methods > Offline RL Machine Learning > Learning Types > Reinforcement Learning Artificial Intelligence > Core AI > Robotics

Keywords

offline reinforcement learning policy optimization hyperparameter optimization batch reinforcement learning off-policy learning held-out datum

Download PDF

Related papers

Correlation Clustering with Asymmetric Classification Errors 2020

Learning Portable Representations for High-Level Planning 2020

Proving the Lottery Ticket Hypothesis: Pruning is All You Need 2020

Minimax Pareto Fairness: A Multi Objective Perspective 2020

DeepMatch: Balancing Deep Covariate Representations for Causal Inference Using Adversarial Training 2020