LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

Shenghao Fu; Qize Yang; Qijie Mo; Junkai Yan; Xihan Wei; Jingke Meng; Xiaohua Xie; Wei-Shi Zheng

2025 CVPR CVPR 2025

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

Abstract

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset will be available.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — large language model supervision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shenghao Fu , Qize Yang , Qijie Mo , Junkai Yan , Xihan Wei , Jingke Meng , Xiaohua Xie , Wei-Shi Zheng

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Domain Generalization Computer Vision > Analysis > Object Detection Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Large Language Models Deep Learning > Learning Types > Multi-Modal Learning

Keywords

image captioning multi-modal learning image caption generation large language model open vocabulary object detection large language model supervision region level annotation grounding model open-vocabulary object detector region grounding

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025