Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Methods
Reinforcement Learning
›
Methods
›
Deep RL
3861 directly classified papers
Papers per year
2005: 1
2006: 9
2007: 14
2008: 15
2009: 9
2010: 21
2011: 27
2012: 32
2013: 21
2014: 17
2015: 10
2016: 33
2017: 102
2018: 222
2019: 399
2020: 450
2021: 533
2022: 478
2023: 532
2024: 513
2025: 326
2026: 97
Papers
One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL
ACL 2025
Steering LLM Reasoning Through Bias-Only Adaptation
EMNLP 2025
Identification of Multiple Logical Interpretations in Counter-Arguments
EMNLP 2025
OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization
ACL 2025
AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments
ACL 2025
Dialogue Systems for Emotional Support via Value Reinforcement
ACL 2025
ReflectDiffu: Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework
ACL 2025
S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
ACL 2025
Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models
ACL 2025
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
ACL 2025
Optimizing RLHF Training for Large Language Models with Stage Fusion
NSDI 2025
Fixing Distribution Shifts of LLM Self-Critique via On-Policy Self-Play Training
ACL 2025
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
ACL 2025
EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning
ACL 2025
SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning
ACL 2025
ACECODER: Acing Coder RL via Automated Test-Case Synthesis
ACL 2025
Lost in the Context: Insufficient and Distracted Attention to Contexts in Preference Modeling
ACL 2025
FloorPlan-LLaMa: Aligning Architects’ Feedback and Domain Knowledge in Architectural Floor Plan Generation
ACL 2025
Optimizing Decomposition for Optimal Claim Verification
ACL 2025
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
ACL 2025
GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization
ACL 2025
Improve Vision Language Model Chain-of-thought Reasoning
ACL 2025
ReinDiffuse: Crafting Physically Plausible Motions with Reinforced Diffusion Model
WACV 2025
An Efficient Task-Oriented Dialogue Policy: Evolutionary Reinforcement Learning Injected by Elite Individuals
ACL 2025
ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
EMNLP 2025
<
1
…
7
8
9
…
155
>