Maximum Entropy Softmax Policy Gradient via Entropy Advantage Estimation

Jean Seong Bjorn Choe; Jong-kook Kim

2025 IJCAI IJCAI 2025

Maximum Entropy Softmax Policy Gradient via Entropy Advantage Estimation

Abstract

Entropy Regularisation is a widely adopted technique that enhances policy optimisation performance and stability. Maximum entropy reinforcement learning (MaxEnt RL) regularises policy evaluation by augmenting the objective with an entropy term, showing theoretical benefits in policy optimisation. However, its practical application in straightforward direct policy gradient settings remains surprisingly underexplored. We hypothesise that this is due to the difficulty of managing the entropy reward in practice. This paper proposes Entropy Advantage Policy Optimisation (EAPO), a simple method that facilitates MaxEnt RL implementation by separately estimating task and entropy objectives. Our empirical evaluations demonstrate that extending Proximal Policy Optimisation (PPO) and Trust Region Policy Optimisation (TRPO) within the MaxEnt framework improves optimisation performance, generalisation, and exploration in various environments. Moreover, our method provides a stable and performant MaxEnt RL algorithm for discrete action spaces.

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — entropy advantage

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jean Seong Bjorn Choe , Jong-kook Kim

Topics

Machine Learning > Optimization & Theory > Loss Functions Machine Learning > Optimization & Theory > Optimization Reinforcement Learning > Methods > Policy Learning

Keywords

maximum entropy policy gradient entropy regularization proximal policy optimization trust region policy optimization entropy advantage

Download PDF

Related papers

Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain 2025

Responsibility Anticipation and Attribution in LTLf 2025

Argument-based Multi-Issue Negotiation 2025

Online Resource Sharing: Better Robust Guarantees via Randomized Strategies 2025

Equitable Mechanism Design for Facility Location 2025