Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Machine Learning
›
Learning Types
›
Reinforcement Learning
2932 directly classified papers
Papers per year
2003: 1
2006: 11
2007: 18
2008: 23
2009: 14
2010: 22
2011: 24
2012: 34
2013: 26
2014: 24
2015: 14
2016: 23
2017: 79
2018: 182
2019: 255
2020: 284
2021: 333
2022: 319
2023: 315
2024: 457
2025: 419
2026: 55
Papers
Debate4MATH: Multi-Agent Debate for Fine-Grained Reasoning in Math
ACL 2025
DEBATE, TRAIN, EVOLVE: Self‐Evolution of Language Model Reasoning
EMNLP 2025
Natural Logic at the Core: Dynamic Rewards for Entailment Tree Generation
ACL 2025
LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization
EMNLP 2025
SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
EMNLP 2025
OpenRLHF: A Ray-based Easy-to-use, Scalable and High-performance RLHF Framework
EMNLP 2025
MWPO: Enhancing LLMs Performance through Multi-Weight Preference Strength and Length Optimization
ACL 2025
Detoxifying Large Language Models via the Diversity of Toxic Samples
EMNLP 2025
DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization
EMNLP 2025
When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning
EMNLP 2025
HS-STaR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation
EMNLP 2025
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
CVPR 2025
bea-jh at BEA 2025 Shared Task: Evaluating AI-powered Tutors through Pedagogically-Informed Reasoning
ACL 2025
Personalized Preference Fine-tuning of Diffusion Models
CVPR 2025
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification
ACL 2025
Adversarial Preference Learning for Robust LLM Alignment
ACL 2025
Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction
IJCNLP 2025
Continuously evolving rewards in an open-ended environment
JMLR 2025
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
IJCNLP 2025
A Reinforcement Learning Framework for Cross-Lingual Stance Detection Using Chain-of-Thought Alignment
ACL 2025
Structured Document Translation via Format Reinforcement Learning
IJCNLP 2025
Score-Aware Policy-Gradient and Performance Guarantees using Local Lyapunov Stability
JMLR 2025
Understanding Reference Policies in Direct Preference Optimization
NAACL 2025
Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning
ACL 2025
A Practical Analysis of Human Alignment with *PO
NAACL 2025
<
1
…
8
9
10
…
118
>