2024 CVPR CVPR 2024

On Scaling Up a Multilingual Vision and Language Model

Abstract

We explore the boundaries of scaling up a multilingual vision and language model both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks including multiple image-based captioning and question-answering tasks image-based document understanding and few-shot (in-context) learning as well as object detection video question answering and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally we observe emerging capabilities such as complex counting and multilingual object detection tasks that are not explicitly in the training mix.

👥 Mega-Team — 43 authors
🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing
🧭 Keyword Pioneer — multilingual vision language model
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio