Reinforcement Learning › Methods ›

Deep RL

3861 directly classified papers

Papers per year

Papers

One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL ACL 2025

Steering LLM Reasoning Through Bias-Only Adaptation EMNLP 2025

Identification of Multiple Logical Interpretations in Counter-Arguments EMNLP 2025

OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization ACL 2025

AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments ACL 2025

Dialogue Systems for Emotional Support via Value Reinforcement ACL 2025

ReflectDiffu: Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework ACL 2025

S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning ACL 2025

Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models ACL 2025

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse ACL 2025

Optimizing RLHF Training for Large Language Models with Stage Fusion NSDI 2025

Fixing Distribution Shifts of LLM Self-Critique via On-Policy Self-Play Training ACL 2025

TreeRL: LLM Reinforcement Learning with On-Policy Tree Search ACL 2025

EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning ACL 2025

SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning ACL 2025

ACECODER: Acing Coder RL via Automated Test-Case Synthesis ACL 2025

Lost in the Context: Insufficient and Distracted Attention to Contexts in Preference Modeling ACL 2025

FloorPlan-LLaMa: Aligning Architects’ Feedback and Domain Knowledge in Architectural Floor Plan Generation ACL 2025

Optimizing Decomposition for Optimal Claim Verification ACL 2025

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis ACL 2025

GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization ACL 2025

Improve Vision Language Model Chain-of-thought Reasoning ACL 2025

ReinDiffuse: Crafting Physically Plausible Motions with Reinforced Diffusion Model WACV 2025

An Efficient Task-Oriented Dialogue Policy: Evolutionary Reinforcement Learning Injected by Elite Individuals ACL 2025

ActionStudio: A Lightweight Framework for Data and Training of Large Action Models EMNLP 2025