SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Wei Li; Renshan Zhang; Rui Shao; Zhijian Fang; Kaiwen Zhou; Zhuotao Tian; Liqiang Nie

2026 AAAI AAAI 2026

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Abstract

Abstract Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: **1) perceptual redundancy**, where irrelevant visual inputs are processed inefficiently, and **2) superficial instruction-vision alignment**, which hampers semantic grounding of actions. In this paper, we propose **SemanticVLA**, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: **1)** To sparsify redundant perception while preserving semantic alignment, **Semantic-guided Dual Visual Pruner (SD-Pruner)** performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. **2)** To exploit sparsified features and integrate semantics with spatial geometry, **Semantic-complementary Hierarchical Fuser (SH-Fuser)** fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. **3)** To enhance the transformation from perception to action, **Semantic-conditioned Action Coupler (SA-Coupler)** replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by **21.1%** in success rate, while reducing training cost and inference latency by **3.0×** and **2.7×**.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — perceptual redundancy

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Wei Li , Renshan Zhang , Rui Shao , Zhijian Fang , Kaiwen Zhou , Zhuotao Tian , Liqiang Nie

Topics

Machine Learning > Application Areas > Domain Adaptation Machine Learning > Application Areas > Efficient Computing Deep Learning > Techniques > Model Architecture

Keywords

semantic alignment feature fusion vision language action perceptual redundancy

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026