SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation

Hao Shi; Bin Xie; Yingfei Liu; Yang Yue; Tiancai Wang; Haoqiang Fan; Xiangyu Zhang; Gao Huang

2026 AAAI AAAI 2026

SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation

Abstract

Abstract Robotic manipulation requires precise spatial understanding to interact with objects in the real world. Point-based methods suffer from sparse sampling, leading to the loss of fine-grained semantics. Image-based methods typically feed RGB and depth into 2D backbones pre-trained on 3D auxiliary tasks, but their entangled semantics and geometry are sensitive to inherent depth noise in real-world that disrupts semantic understanding. Moreover, these methods focus on high-level geometry while overlooking low-level spatial cues essential for precise interaction. We propose SpatialActor, a disentangled framework for robust robotic manipulation that explicitly decouples semantics and geometry. The Semantic-guided Geometric Module adaptively fuses two complementary geometry from noisy depth and semantic-guided expert priors. Also, a Spatial Transformer leverages low-level spatial cues for accurate 2D-3D mapping and enables interaction among spatial features. We evaluate SpatialActor on multiple simulation and real-world scenarios across 50+ tasks. It achieves state-of-the-art performance with 87.4% on RLBench and improves by 13.9% to 19.4% under varying noisy conditions, showing strong robustness. Moreover, it significantly enhances few-shot generalization to new tasks and maintains robustness under various spatial perturbations.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — depth noise

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hao Shi , Bin Xie , Yingfei Liu , Yang Yue , Tiancai Wang , Haoqiang Fan , Xiangyu Zhang , Gao Huang

Topics

Artificial Intelligence > Core AI > Agent Systems Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Few-Shot Learning

Keywords

robotic manipulation disentangled representation spatial transformer few-shot generalization depth noise

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026