2024
CVPR
CVPR 2024
On Scaling Up a Multilingual Vision and Language Model
Abstract
We explore the boundaries of scaling up a multilingual vision and language model both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks including multiple image-based captioning and question-answering tasks image-based document understanding and few-shot (in-context) learning as well as object detection video question answering and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally we observe emerging capabilities such as complex counting and multilingual object detection tasks that are not explicitly in the training mix.
👥
Mega-Team
— 43 authors
🌉
Interdisciplinary Bridge
— Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing
🧭
Keyword Pioneer
— multilingual vision language model
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
Authors
Xi Chen
,
Josip Djolonga
,
Piotr Padlewski
,
Basil Mustafa
,
Soravit Changpinyo
,
Jialin Wu
,
Carlos Riquelme Ruiz
,
Sebastian Goodman
,
Xiao Wang
,
Yi Tay
,
Siamak Shakeri
,
Mostafa Dehghani
,
Daniel Salz
,
Mario Lucic
,
Michael Tschannen
,
Arsha Nagrani
,
Hexiang Hu
,
Mandar Joshi
,
Bo Pang
,
Ceslee Montgomery
,
Paulina Pietrzyk
,
Marvin Ritter
,
AJ Piergiovanni
,
Matthias Minderer
,
Filip Pavetic
,
Austin Waters
,
Gang Li
,
Ibrahim Alabdulmohsin
,
Lucas Beyer
,
Julien Amelot
,
Kenton Lee
,
Andreas Peter Steiner
,
Yang Li
,
Daniel Keysers
,
Anurag Arnab
,
Yuanzhong Xu
,
Keran Rong
,
Alexander Kolesnikov
,
Mojtaba Seyedhosseini
,
Anelia Angelova
,
Xiaohua Zhai
,
Neil Houlsby
,
Radu Soricut
Topics
Artificial Intelligence > Core AI > Multimodal Learning
Deep Learning > Architectures > Transformers
Computer Vision > Generation > Image Captioning
Natural Language Processing > Resources & Methods > Large Language Models
Machine Learning > Learning Types > Few-Shot Learning
Artificial Intelligence > Core AI > Large Language Models
Deep Learning > Models > Large Language Models
Deep Learning > Learning Types > Multi-Modal Learning
Deep Learning > Learning Types > Transfer Learning
Deep Learning > Models > Vision-Language Models