2025 ICML ICML 2025

Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback