Look Around Before Locating: Considering Content and Structure Information for Visual Grounding

Shiyi Zheng; Peizhi Zhao; Zhilong Zheng; Peihang He; Haonan Cheng; Yi Cai; Qingbao Huang

2025 AAAI AAAI 2025

Look Around Before Locating: Considering Content and Structure Information for Visual Grounding

Abstract

Abstract As a long-term challenge and fundamental requirement in vision and language tasks, visual grounding aims to localize a target referred by a natural language query. The regional annotations form a superficial correlation between the subject of expression and some common visual entities, which hinder models from comprehending the linguistic content and structure. However, current one-stage methods struggle to uniformly model the visual and linguistic structure due to the structural gap between continuous image patches and discrete text tokens. In this paper, we propose a semi-structured reasoning framework for visual grounding to gradually comprehend the linguistic content and structure. Specifically, we devise a cross-modal content alignment module to effectively align unlabeled contextual information into a stable semantic space corrected by token-level prior knowledge obtained with CLIP. A multi-branch modulated localization module is also established to obtain modulation grounding by linguistic structure. Through a soft split mechanism, our method can destructure the expression into a fixed semi-structure (i.e., subject and context) while ensuring the completeness of linguistic content. Our method is thus capable of building a semi-structured reasoning system to effectively comprehend the linguistic content and structure by content alignment and structure modulated grounding. Experimental results on five widely-used datasets validate the performance improvements of our proposed method.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — semi-structured reasoning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shiyi Zheng , Peizhi Zhao , Zhilong Zheng , Peihang He , Haonan Cheng , Yi Cai , Qingbao Huang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Computer Vision > Analysis > Object Detection Computer Vision > Analysis > Scene Understanding Computer Vision > Core AI > Computer Vision Artificial Intelligence > Core AI > Language Deep Learning > Learning Types > Multi-Modal Learning

Keywords

scene understanding cross-modal learning visual grounding semantic space cross-modal alignment linguistic structure vision language semi-structured reasoning

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025