ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models

Zirui Song; Guangxian Ouyang; Mingzhe Li; Yuheng Ji; Chenxi Wang; Zixiang Xu; Zeyu Zhang; Xiaoqing Zhang; Qian Jiang; Fengxian Ji; Zhenhao Chen; Zhongzhi Li; Xiuying Chen

2026 AAAI AAAI 2026

ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models

Abstract

Abstract Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training datasets, which limits their generalization and causes them to struggle in out-of-domain (OOD) scenarios, reducing real-world adaptability. To address these challenges, we propose ManipLVM-R1, a novel reinforcement learning framework that replaces traditional supervision with Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing for task-aligned outcomes, our method enhances generalization and physical reasoning while removing the dependence on costly annotations. Specifically, we design two rule-based reward functions targeting key robotic manipulation subtasks: an Affordance Perception Reward to enhance localization of interaction regions, and a Trajectory Match Reward to ensure the physical plausibility of action paths. These rewards provide immediate feedback and impose spatial-logical constraints, encouraging the model to go beyond shallow pattern matching and instead learn deeper, more systematic reasoning about physical interactions. Experimental results show that ManipLVM-R1 achieves substantial performance gains across multiple manipulation tasks, using only 50% of the training data while achieving strong generalization to OOD scenarios. We further analyze the benefits of our reward design and its impact on task success and efficiency.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zirui Song , Guangxian Ouyang , Mingzhe Li , Yuheng Ji , Chenxi Wang , Zixiang Xu , Zeyu Zhang , Xiaoqing Zhang , Qian Jiang , Fengxian Ji , Zhenhao Chen , Zhongzhi Li , Xiuying Chen

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning

Keywords

reinforcement learning zero-shot generalization large vision-language model verifiable reward embodied manipulation

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026