← Learning Types

Machine Learning › Learning Types ›

Reinforcement Learning

2932 directly classified papers

Papers per year

Papers

DeMAC: Enhancing Multi-Agent Coordination with Dynamic DAG and Manager-Player Feedback EMNLP 2025

Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving ACL 2025

When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning EMNLP 2025

CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation ACL 2025

Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points ACL 2025

Enhancing Logical Reasoning in Language Models via Symbolically-Guided Monte Carlo Process Supervision EMNLP 2025

MiniELM: A Lightweight and Adaptive Query Rewriting Framework for E-Commerce Search Optimization ACL 2025

Breaking the Reasoning Barrier A Survey on LLM Complex Reasoning through the Lens of Self-Evolution ACL 2025

LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information ACL 2025

On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation ACL 2025

To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization ACL 2025

Speculative Reward Model Boosts Decision Making Ability of LLMs Cost-Effectively ACL 2025

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond ACL 2025

Proactive Guidance of Multi-Turn Conversation in Industrial Search ACL 2025

One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL ACL 2025

BLCU-ICALL at BEA 2025 Shared Task: Multi-Strategy Evaluation of AI Tutors ACL 2025

Henry at BEA 2025 Shared Task: Improving AI Tutor’s Guidance Evaluation Through Context-Aware Distillation ACL 2025

Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment EMNLP 2025

Steering LLM Reasoning Through Bias-Only Adaptation EMNLP 2025

RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution EMNLP 2025

Identification of Multiple Logical Interpretations in Counter-Arguments EMNLP 2025

CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback ACL 2025

T-REG: Preference Optimization with Token-Level Reward Regularization ACL 2025

Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory ACL 2025

Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective ACL 2025