SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation
Abstract
Abstract Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: **1) perceptual redundancy**, where irrelevant visual inputs are processed inefficiently, and **2) superficial instruction-vision alignment**, which hampers semantic grounding of actions. In this paper, we propose **SemanticVLA**, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: **1)** To sparsify redundant perception while preserving semantic alignment, **Semantic-guided Dual Visual Pruner (SD-Pruner)** performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. **2)** To exploit sparsified features and integrate semantics with spatial geometry, **Semantic-complementary Hierarchical Fuser (SH-Fuser)** fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. **3)** To enhance the transformation from perception to action, **Semantic-conditioned Action Coupler (SA-Coupler)** replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by **21.1%** in success rate, while reducing training cost and inference latency by **3.0×** and **2.7×**.