IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation

Qi Chen; Changli Wu; Jiayi Ji; Yiwei Ma; Danni Yang; Xiaoshuai Sun

2025 AAAI AAAI 2025

IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation

Abstract

Abstract 3D Referring Expression Segmentation (3D-RES) aims to segment point cloud scenes based on a given expression. However, existing 3D-RES approaches face two major challenges: feature ambiguity and intent ambiguity. Feature ambiguity arises from information loss or distortion during point cloud acquisition due to limitations such as lighting and viewpoint. Intent ambiguity refers to the model's equal treatment of all queries during the decoding process, lacking top-down task-specific guidance. In this paper, we introduce an Image-enhanced Prompt Decoding Network (IPDN), which leverages multi-view images and task-driven information to enhance the model's reasoning capabilities. To address feature ambiguity, we propose the Multi-view Semantic Embedding (MSE) module, which injects multi-view 2D image information into the 3D scene and compensates for potential spatial information loss. To tackle intent ambiguity, we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by deriving task-driven signals from the interaction between the expression and visual features. Comprehensive experiments demonstrate that IPDN outperforms the state-of-the-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and 3D-GRES tasks, respectively.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — 3d referring expression

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Qi Chen , Changli Wu , Jiayi Ji , Yiwei Ma , Danni Yang , Xiaoshuai Sun

Topics

Computer Vision > Analysis > 3D Vision Computer Vision > Processing > Image Segmentation Computer Vision > Processing > Semantic Segmentation Artificial Intelligence > Core AI > Computer Vision Computer Vision > Analysis > Object Segmentation

Keywords

semantic segmentation image segmentation multimodal learning point cloud 3d vision referring expression multi-view learning 3d segmentation multi-view image point cloud processing referring expression segmentation 3d referring expression prompt decoding

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025