Reinforcement Learning › Methods ›

Policy Learning

2068 directly classified papers

Papers per year

Papers

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness CVPR 2025

SeqMvRL: A Sequential Fusion Framework for Multi-view Representation Learning CVPR 2025

UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping CVPR 2025

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories CVPR 2025

Incorporating Review-missing Interactions for Generative Explainable Recommendation COLING 2025

SATA: Safe and Adaptive Torque-Based Locomotion Policies Inspired by Animal Learning RSS 2025

Leveraging LLM-based sentiment analysis for portfolio optimization with proximal policy optimization ACL 2025

bea-jh at BEA 2025 Shared Task: Evaluating AI-powered Tutors through Pedagogically-Informed Reasoning ACL 2025

LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback ACL 2025

PGPO: Enhancing Agent Reasoning via Pseudocode-style Planning Guided Preference Optimization ACL 2025

Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving ACL 2025

Towards Medical Complex Reasoning with LLMs through Medical Verifiable Problems ACL 2025

CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization EMNLP 2025

DOGlove: Dexterous Manipulation with a Low-Cost Open-Source Haptic Force Feedback Glove RSS 2025

Offline Reinforcement Learning for LLM Multi-step Reasoning ACL 2025

Imitation Learning via Focused Satisficing IJCAI 2025

Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration ACL 2025

Proactive Guidance of Multi-Turn Conversation in Industrial Search ACL 2025

T-REG: Preference Optimization with Token-Level Reward Regularization ACL 2025

Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models ACL 2025

SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation ACL 2025

YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering ACL 2025

Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models ACL 2025

Image Difference Captioning via Adversarial Preference Optimization EMNLP 2025

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents ACL 2025