On Scaling Up a Multilingual Vision and Language Model

Xi Chen; Josip Djolonga; Piotr Padlewski; Basil Mustafa; Soravit Changpinyo; Jialin Wu; Carlos Riquelme Ruiz; Sebastian Goodman; Xiao Wang; Yi Tay; Siamak Shakeri; Mostafa Dehghani; Daniel Salz; Mario Lucic; Michael Tschannen; Arsha Nagrani; Hexiang Hu; Mandar Joshi; Bo Pang; Ceslee Montgomery; Paulina Pietrzyk; Marvin Ritter; AJ Piergiovanni; Matthias Minderer; Filip Pavetic; Austin Waters; Gang Li; Ibrahim Alabdulmohsin; Lucas Beyer; Julien Amelot; Kenton Lee; Andreas Peter Steiner; Yang Li; Daniel Keysers; Anurag Arnab; Yuanzhong Xu; Keran Rong; Alexander Kolesnikov; Mojtaba Seyedhosseini; Anelia Angelova; Xiaohua Zhai; Neil Houlsby; Radu Soricut

2024 CVPR CVPR 2024

On Scaling Up a Multilingual Vision and Language Model

Abstract

We explore the boundaries of scaling up a multilingual vision and language model both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks including multiple image-based captioning and question-answering tasks image-based document understanding and few-shot (in-context) learning as well as object detection video question answering and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally we observe emerging capabilities such as complex counting and multilingual object detection tasks that are not explicitly in the training mix.

👥 Mega-Team — 43 authors

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — multilingual vision language model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xi Chen , Josip Djolonga , Piotr Padlewski , Basil Mustafa , Soravit Changpinyo , Jialin Wu , Carlos Riquelme Ruiz , Sebastian Goodman , Xiao Wang , Yi Tay , Siamak Shakeri , Mostafa Dehghani , Daniel Salz , Mario Lucic , Michael Tschannen , Arsha Nagrani , Hexiang Hu , Mandar Joshi , Bo Pang , Ceslee Montgomery , Paulina Pietrzyk , Marvin Ritter , AJ Piergiovanni , Matthias Minderer , Filip Pavetic , Austin Waters , Gang Li , Ibrahim Alabdulmohsin , Lucas Beyer , Julien Amelot , Kenton Lee , Andreas Peter Steiner , Yang Li , Daniel Keysers , Anurag Arnab , Yuanzhong Xu , Keran Rong , Alexander Kolesnikov , Mojtaba Seyedhosseini , Anelia Angelova , Xiaohua Zhai , Neil Houlsby , Radu Soricut

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers Computer Vision > Generation > Image Captioning Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Learning Types > Few-Shot Learning Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Transfer Learning Deep Learning > Models > Vision-Language Models

Keywords

few-shot learning object detection visual question answering in-context learning multimodal learning image captioning vision language model vision-language model model scaling large language model multilingual vision language model multilingual vision and language

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024