Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning

Sanghwan Bae; Jiwoo Hong; Min Young Lee; Hanbyul Kim; JeongYeon Nam; Donghyun Kwak

2026 EACL EACL 2026

Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning

Abstract

AbstractRecent advances in reinforcement learning with verifiable rewards (RLVR) show that large language models enhance their reasoning abilities when trained with verifiable signals. However, due to reward sparsity, effectiveness depends heavily on selecting samples of appropriate difficulty. In this work, we present a formal analysis of online difficulty-aware filtering and establish its theoretical foundations. We show that expected policy improvement is lower-bounded by the variance of task-level success probabilities, implying that selecting tasks of intermediate difficulty maximizes learning efficiency. Building on this, we demonstrate that balanced filtering maximizes this lower bound, leading to superior performance and sample efficiency. Evaluations across multiple math reasoning benchmarks validate that balanced filtering consistently enhances convergence speed and final performance, achieving up to +12% gains in less than half the training steps of standard GRPO. By extending our analysis to various reward distributions, we provide a principled foundation for future RLVR curriculum strategies, confirmed through both theoretical analysis and extensive empirical results.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sanghwan Bae , Jiwoo Hong , Min Young Lee , Hanbyul Kim , JeongYeon Nam , Donghyun Kwak

Topics

Artificial Intelligence > Core AI > Agent Systems Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Optimization & Theory > Optimization

Keywords

reinforcement learning sample efficiency mathematical reasoning policy improvement verifiable reward difficulty filtering

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026