The Sufficiency of Off-Policyness and Soft Clipping: PPO Is Still Insufficient according to an Off-Policy Measure

Xing Chen; Dongcui Diao; Hechang Chen; Hengshuai Yao; haiyin piao; Zhixiao Sun; Zhiwei Yang; Randy Goebel; Bei Jiang; Yi Chang

2023 AAAI AAAI 2023

The Sufficiency of Off-Policyness and Soft Clipping: PPO Is Still Insufficient according to an Off-Policy Measure

Abstract

Abstract The popular Proximal Policy Optimization (PPO) algorithm approximates the solution in a clipped policy space. Does there exist better policies outside of this space? By using a novel surrogate objective that employs the sigmoid function (which provides an interesting way of exploration), we found that the answer is "YES", and the better policies are in fact located very far from the clipped space. We show that PPO is insufficient in "off-policyness", according to an off-policy metric called DEON. Our algorithm explores in a much larger policy space than PPO, and it maximizes the Conservative Policy Iteration (CPI) objective better than PPO during training. To the best of our knowledge, all current PPO methods have the clipping operation and optimize in the clipped policy space. Our method is the first of this kind, which advances the understanding of CPI optimization and policy gradient methods. Code is available at https://github.com/raincchio/P3O.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Reinforcement Learning

🧭 Keyword Pioneer — soft clipping

🐣 Hot Topic Early Bird — proximal policy optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xing Chen , Dongcui Diao , Hechang Chen , Hengshuai Yao , haiyin piao , Zhixiao Sun , Zhiwei Yang , Randy Goebel , Bei Jiang , Yi Chang

Topics

Reinforcement Learning > Methods > Deep RL Reinforcement Learning > Methods > Policy Learning Artificial Intelligence > Core AI > Reinforcement Learning

Keywords

policy gradient off-policy learning proximal policy optimization conservative policy iteration soft clipping

Download PDF

Related papers

A Model-Agnostic Heuristics for Selective Classification 2023

Tackling Safe and Efficient Multi-Agent Reinforcement Learning via Dynamic Shielding (Student Abstract) 2023

Head-Free Lightweight Semantic Segmentation with Linear Transformer 2023

Hierarchical ConViT with Attention-Based Relational Reasoner for Visual Analogical Reasoning 2023

Deep Spiking Neural Networks with High Representation Similarity Model Visual Pathways of Macaque and Mouse 2023