← Learning Types

Machine Learning › Learning Types ›

Reinforcement Learning

2932 directly classified papers

Papers per year

Papers

Teaching an Old LLM Secure Coding: Localized Preference Optimization on Distilled Preferences ACL 2025

InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating ACL 2025

Structured Document Translation via Format Reinforcement Learning IJCNLP 2025

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training ICCV 2025

HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks ACL 2025

DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness ICCV 2025

Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation EMNLP 2025

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models ACL 2025

Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning ACL 2025

CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback ACL 2025

CARMO: Dynamic Criteria Generation for Context Aware Reward Modelling ACL 2025

CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis ACL 2025

Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction IJCNLP 2025

Visual-RFT: Visual Reinforcement Fine-Tuning ICCV 2025

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization ACL 2025

Enhancing Machine Translation with Self-Supervised Preference Data ACL 2025

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs IJCNLP 2025

Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment ICCV 2025

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement ACL 2025

KERLQA: Knowledge-Enhanced Reinforcement Learning for Question Answering in Low-resource Languages IJCNLP 2025

On the Convergence of Moral Self-Correction in Large Language Models IJCNLP 2025

Atomic Consistency Preference Optimization for Long-Form Question Answering IJCNLP 2025

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning NAACL 2025

Understanding Reference Policies in Direct Preference Optimization NAACL 2025

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences ICCV 2025