VLP: Vision-Language Preference Learning for Embodied Manipulation

Runze Liu; Chenjia Bai; Jiafei Lyu; Shengjie Sun; Yali Du; Xiu Li

2025 EMNLP EMNLP 2025

VLP: Vision-Language Preference Learning for Embodied Manipulation

Abstract

AbstractReward engineering is one of the key challenges in Reinforcement Learning (RL). Preference-based RL effectively addresses this issue by learning from human feedback. However, it is both time-consuming and expensive to collect human preference labels. In this paper, we propose a novel Vision-Language Preference learning framework, named VLP, which learns a vision-language preference model to provide feedback for embodied manipulation tasks. To achieve this, we define three types of language-conditioned preferences and construct a vision-language preference dataset, which contains versatile implicit preference orders. The model learns to extract language-related features, and then serves as a predictor in various downstream tasks. The policy can be learned according to the annotated labels via reward learning or direct policy optimization. Extensive empirical results on simulated embodied manipulation tasks demonstrate that our method provides accurate preferences and generalizes to unseen tasks and unseen language instructions, outperforming the baselines by a large margin and shifting the burden from continuous, per-task human annotation to one-time, per-domain data collection.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Reinforcement Learning

🧭 Keyword Pioneer — vision-language preference learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Runze Liu , Chenjia Bai , Jiafei Lyu , Shengjie Sun , Yali Du , Xiu Li

Topics

Artificial Intelligence > Core AI > Multimodal Learning Reinforcement Learning > Methods > Policy Learning Reinforcement Learning > Applications > Robotics Artificial Intelligence > Core AI > Robotics Artificial Intelligence > Core AI > Reinforcement Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

reinforcement learning preference learning reward learning human feedback vision-language model preference-based reinforcement learning vision-language preference learning embodied manipulation language-conditioned preference

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025