Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization

Jonathan Heewon Yang; Chuyuan Fu; Dhruv Shah; Dorsa Sadigh; Fei Xia; Tingnan Zhang

2025 RSS RSS 2025

Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization

Abstract

In this work, we investigate how spatially-grounded auxiliary representations can provide both broad, high-level grounding, as well as direct, actionable information to help policy learning performance and generalization for dexterous tasks. We study these mid-level representations across three critical dimensions: object-centricity, pose-awareness, and depth-awareness. We use these interpretable mid-level representations to train specialist encoders via supervised learning, then use these representations as inputs to a diffusion policy to solve dexterous bimanual manipulation tasks in the real-world. We propose a novel mixture-of-experts policy architecture that can combine multiple specialized expert models, each trained on a distinct mid-level representation, to improve the generalization of the policy. This method achieves an average of 11% higher success rate on average over a language-grounded baseline and a 24% higher success rate over a standard diffusion policy baseline for our evaluation tasks. Furthermore, we find that leveraging mid-level representations as supervision signals for policy actions within a weighted imitation learning algorithm improves the precision with which the policy follows these representations, leading to an additional performance increase of 10%. Our findings highlight the importance of grounding robot policies with not only broad, perceptual tasks, but also more granular, actionable representations.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Reinforcement Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jonathan Heewon Yang , Chuyuan Fu , Dhruv Shah , Dorsa Sadigh , Fei Xia , Tingnan Zhang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Reinforcement Learning > Applications > Robotics

Keywords

representation learning imitation learning mixture of expert diffusion policy mid-level representation bimanual manipulation

Download PDF

Related papers

Enhancing Autonomous Driving Systems with On-Board Deployed Large Language Models 2025

Debiasing 6-DOF IMU via Hierarchical Learning of Continuous Bias Dynamics 2025

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models 2025

RoboVerse: A Unified Platform, Benchmark and Dataset for Scalable and Generalizable Robot Learning 2025

Learning Humanoid Standing-up Control across Diverse Postures 2025