ParGo: Bridging Vision-Language with Partial and Global Views

An-Lan Wang; Bin Shan; Wei Shi; Kun-Yu Lin; Xiang Fei; Guozhi Tang; Lei Liao; Jingqun Tang; Can Huang; Wei-Shi Zheng

2025 AAAI AAAI 2025

ParGo: Bridging Vision-Language with Partial and Global Views

Abstract

Abstract This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that ParGo significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability.

🧭 Keyword Pioneer — partial-global view

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

An-Lan Wang , Bin Shan , Wei Shi , Kun-Yu Lin , Xiang Fei , Guozhi Tang , Lei Liao , Jingqun Tang , Can Huang , Wei-Shi Zheng

Topics

Artificial Intelligence > Core AI > Multimodal Learning

Keywords

image captioning vision-language alignment multimodal large language model vision encoder partial-global view detail perception

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025