DoGA: Enhancing Grounded Object Detection via Grouped Pre-Training with Attributes

Yang Liu; Feng Hou; Yunjie Peng; Gangjian Zhang; Yao Zhang; Dong Xie; Peng Wang; Yang Zhang; Jiang Tian; Zhongchao Shi; Jianping Fan; Zhiqiang He

2025 AAAI AAAI 2025

DoGA: Enhancing Grounded Object Detection via Grouped Pre-Training with Attributes

Abstract

Abstract Recent advances in vision-language pre-training have significantly enhanced the model capabilities on grounded object detection. However, these studies often pre-train with coarse-grained text prompts, such as plain category names and brief grounded phrases. This limitation curtails the model's capacity for fine-grained linguistic comprehension and leads to a significant decline in performance when faced with detailed descriptions or contextual information. To tackle these problems, we develop DoGA: Detect objects with Grouped Attributes, which employs commonly apparent attributes to bridge different granular semantics and uses specific attributes to identify the object discrepancy. Our DoGA incorporates three principle components: 1) Generation of attribute-based prompts, consisting of linguistic definitions enriched with common-sense visible attributes and hard negative notations deriving from the image-specific attribute features; 2) Paralleled entity fusion and optimization, designed to manage long attribute-based descriptions and negative concepts efficiently; and 3) Prompt-wise grouped training to accommodate model to perform many-to-many assignments, facilitating simultaneous training and inferring with multiple attribute-based synonyms. Extensive experiments demonstrate that training with synonymous attribute-based prompts allows DoGA to generalize multi-granular prompts and surpass previous state-of-the-art approaches, yielding 50.2 on the COCO and 38.0 on the LVIS benchmarks under the zero-short setting. We will make our code publicly available upon acceptance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — attribute-based detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yang Liu , Feng Hou , Yunjie Peng , Gangjian Zhang , Yao Zhang , Dong Xie , Peng Wang , Yang Zhang , Jiang Tian , Zhongchao Shi , Jianping Fan , Zhiqiang He

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning Computer Vision > Analysis > Object Detection Computer Vision > Core AI > Computer Vision Deep Learning > Learning Types > Multi-Modal Learning

Keywords

zero-shot learning multi-modal learning vision-language pre-training zero-shot object detection grounded object detection attribute-based detection attribute-based learning

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025