InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen; Jiannan Wu; Wenhai Wang; Weijie Su; Guo Chen; Sen Xing; Muyan Zhong; Qinglong Zhang; Xizhou Zhu; Lewei Lu; Bin Li; Ping Luo; Tong Lu; Yu Qiao; Jifeng Dai

2024 CVPR CVPR 2024

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Abstract

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However the progress in vision and vision-language foundation models which are also critical elements of multi-modal AGI has not kept pace with LLMs. In this work we design a large-scale vision-language foundation model (InternVL) which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition vision-language tasks such as zero-shot image/video classification zero-shot image/video-text retrieval and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐣 Hot Topic Early Bird — vision-language alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhe Chen , Jiannan Wu , Wenhai Wang , Weijie Su , Guo Chen , Sen Xing , Muyan Zhong , Qinglong Zhang , Xizhou Zhu , Lewei Lu , Bin Li , Ping Luo , Tong Lu , Yu Qiao , Jifeng Dai

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Foundation Models Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

zero-shot learning vision-language alignment multi-modal learning zero-shot image classification foundation model vision-language model image-text retrieval vision foundation model large-scale training multi-modal dialogue system

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024