← Learning Types

Machine Learning › Learning Types ›

Reinforcement Learning

2932 directly classified papers

Papers per year

Papers

Debate4MATH: Multi-Agent Debate for Fine-Grained Reasoning in Math ACL 2025

DEBATE, TRAIN, EVOLVE: Self‐Evolution of Language Model Reasoning EMNLP 2025

Natural Logic at the Core: Dynamic Rewards for Entailment Tree Generation ACL 2025

LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization EMNLP 2025

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models EMNLP 2025

OpenRLHF: A Ray-based Easy-to-use, Scalable and High-performance RLHF Framework EMNLP 2025

MWPO: Enhancing LLMs Performance through Multi-Weight Preference Strength and Length Optimization ACL 2025

Detoxifying Large Language Models via the Diversity of Toxic Samples EMNLP 2025

DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization EMNLP 2025

When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning EMNLP 2025

HS-STaR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation EMNLP 2025

VideoDPO: Omni-Preference Alignment for Video Diffusion Generation CVPR 2025

bea-jh at BEA 2025 Shared Task: Evaluating AI-powered Tutors through Pedagogically-Informed Reasoning ACL 2025

Personalized Preference Fine-tuning of Diffusion Models CVPR 2025

MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification ACL 2025

Adversarial Preference Learning for Robust LLM Alignment ACL 2025

Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction IJCNLP 2025

Continuously evolving rewards in an open-ended environment JMLR 2025

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs IJCNLP 2025

A Reinforcement Learning Framework for Cross-Lingual Stance Detection Using Chain-of-Thought Alignment ACL 2025

Structured Document Translation via Format Reinforcement Learning IJCNLP 2025

Score-Aware Policy-Gradient and Performance Guarantees using Local Lyapunov Stability JMLR 2025

Understanding Reference Policies in Direct Preference Optimization NAACL 2025

Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning ACL 2025

A Practical Analysis of Human Alignment with *PO NAACL 2025