Unified Policy Optimization for Robust Reinforcement Learning

Zichuan Lin; Li Zhao; Jiang Bian; Tao Qin; Guangwen Yang

2019 ACML ACML 2019

Unified Policy Optimization for Robust Reinforcement Learning

Abstract

Recent years have witnessed significant progress in solving challenging problems across various domains using deep reinforcement learning (RL). Despite the success, the weak robustness has risen as a big obstacle for applying existing RL algorithms into real problems. In this paper, we propose unified policy optimization (UPO), a sample-efficient shared policy framework that allows a policy to update itself by considering different gradients generated by different policy gradient (PG) methods. Specifically, we propose two algorithms called UPO-MAB and UPO-ES, to leverage these different gradients by adopting the idea of multi-arm bandit (MAB) and evolution strategies (ES), with the purpose of finding the gradient direction leading to more performance gain with less extra data cost. Extensive experiments show that our approach can lead to stronger robustness and better performance than baselines.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zichuan Lin , Li Zhao , Jiang Bian , Tao Qin , Guangwen Yang

Topics

Reinforcement Learning > Methods > Deep RL Reinforcement Learning > Methods > Policy Learning

Keywords

policy optimization policy gradient robust reinforcement learning multi-arm bandit evolution strategy

Download PDF

Related papers

An Articulated Structure-aware Network for 3D Human Pose Estimation 2019

Model-Based Reinforcement Learning Exploiting State-Action Equivalence 2019

Zero-shot Domain Adaptation Based on Attribute Information 2019

Exemplar Based Mixture Models with Censored Data 2019

Multi-width Activation and Multiple Receptive Field Networks for Dynamic Scene Deblurring 2019