Transferring Foundation Models for Generalizable Robotic Manipulation

Jiange Yang; Wenhui Tan; Chuhao Jin; Keling Yao; Bei Liu; Jianlong Fu; Ruihua Song; Gangshan Wu; Limin Wang

2025 WACV WACV 2025

Transferring Foundation Models for Generalizable Robotic Manipulation

Abstract

Improving the generalization capabilities of general-purpose robotic manipulation in real world has long been a significant challenge. Existing approaches often rely on collecting large-scale robotic data which is costly and time-consuming. However due to insufficient diversity of data they typically suffer from limiting their capability in open-domain scenarios with new objects and diverse environments. In this paper we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models to condition robot manipulation tasks. By integrating the mask modality which incorporates semantic geometric and temporal correlation priors derived from vision foundation models into the end-to-end policy model our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning including new object instances semantic categories and unseen backgrounds. We first introduce a series of foundation models to ground natural language demands across multiple tasks. Secondly we develop a two-stream 2D policy model based on imitation learning which processes raw images and object masks to predict robot actions with a local-global perception manner. Extensive real-world experiments conducted on a Franka Emika robot and a low-cost dual-arm robot demonstrate the effectiveness of our proposed paradigm and policy. Demos can be found in link1 or link2 and our code will be released at https://github.com/MCG-NJU/TPM.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiange Yang , Wenhui Tan , Chuhao Jin , Keling Yao , Bei Liu , Jianlong Fu , Ruihua Song , Gangshan Wu , Limin Wang

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Learning Paradigms > Transfer Learning Machine Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Robotics Deep Learning > Models > Foundation Models

Keywords

semantic segmentation imitation learning robotic manipulation transfer learning vision language model foundation model language reasoning

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025