Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective
Abstract
Abstract The low sampling efficiency during the rollout phase poses a significant challenge to scaling reinforcement learning for large language model reasoning. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To address these challenges, we introduce Competence-Difficulty Alignment Sampling (CDAS). This approach allows for accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies across problems. Subsequently, model competence is quantified to adaptively select problems whose difficulties align with the model's current competence using a fixed-point system. Extensive experiments in mathematical RL training show that CDAS consistently outperforms strong baselines, achieving the highest average accuracy of 45.89%. Furthermore, CDAS reduces the training step time overhead by 57.06% compared to the widely-used Dynamic Sampling strategy, verifying the efficiency of CDAS. Additional experiments on different tasks, model architectures, and model sizes demonstrate the generalization capability of CDAS.