RefDetector: A Simple Yet Effective Matching-based Method for Referring Expression Comprehension

Yabing Wang; Zhuotao Tian; Zheng Qin; Sanping Zhou; Le Wang

2025 AAAI AAAI 2025

RefDetector: A Simple Yet Effective Matching-based Method for Referring Expression Comprehension

Abstract

Abstract Despite the rapid and substantial advancements in object detection, it continues to face limitations imposed by pre-defined category sets. Current methods for visual grounding primarily focus on how to better leverage the visual backbone to generate text-tailored visual features, which may require adjusting the parameters of the entire model. Besides, some early methods, \ie, matching-based method, build upon and extend the functionality of existing object detectors by enabling them to localize an object based on free-form linguistic expressions, which have good application potential. However, the untapped potential of the matching-based approach has not been fully realized due to inadequate exploration. In this paper, we first analyze the limitations that exist in the current matching-based method (\ie, mismatch problem and complicated fusion mechanisms), and then present a simple yet effective matching-based method, namely RefDetector. To tackle the above issues, we devise a simple heuristic rule to generate proposals with improved referent recall. Additionally, we introduce a straightforward vision-language interaction module that eliminates the need for intricate manually-designed mechanisms. Moreover, we have explored the visual grounding based on the modern detector DETR, and achieved significant performance improvement. Extensive experiments on three REC benchmark datasets, \ie, RefCOCO, RefCOCO+, and RefCOCOg validate the effectiveness of the proposed method.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Natural Language Processing

🧭 Keyword Pioneer — matching-based method

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yabing Wang , Zhuotao Tian , Zheng Qin , Sanping Zhou , Le Wang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Object Detection Computer Vision > Core AI > Multimodal Learning Natural Language Processing > Applications > Visual Question Answering Artificial Intelligence > Core AI > Language Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

object detection referring expression visual grounding vision-language model referring expression comprehension matching-based method vision-language interaction

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025