RegionGPT: Towards Region Understanding Vision Language Model

Qiushan Guo; Shalini De Mello; Hongxu Yin; Wonmin Byeon; Ka Chun Cheung; Yizhou Yu; Ping Luo; Sifei Liu

2024 CVPR CVPR 2024

RegionGPT: Towards Region Understanding Vision Language Model

Abstract

Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder and the use of coarse-grained training data that lacks detailed region-specific captions. To address this we introduce RegionGPT (short as RGPT) a novel framework designed for complex region-level captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We further improve performance on tasks requiring a specific output scope by integrating task-guided instruction prompts during both training and inference phases while maintaining the model's versatility for general-purpose tasks. Additionally we develop an automated region caption data generation pipeline enriching the training set with detailed region-level captions. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks including but not limited to complex region descriptions reasoning object classification and referring expressions comprehension.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — region captioning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Qiushan Guo , Shalini De Mello , Hongxu Yin , Wonmin Byeon , Ka Chun Cheung , Yizhou Yu , Ping Luo , Sifei Liu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Models > Vision-Language Models Deep Learning > Learning Types > Instruction Tuning

Keywords

multimodal learning referring expression visual reasoning instruction tuning vision language model vision-language model spatial awareness large language model region understanding region captioning

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024