FLAVA: A Foundational Language and Vision Alignment Model

Amanpreet Singh; Ronghang Hu; Vedanuj Goswami; Guillaume Couairon; Wojciech Galuba; Marcus Rohrbach; Douwe Kiela

2022 CVPR CVPR 2022

FLAVA: A Foundational Language and Vision Alignment Model

Abstract

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once---a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐣 Hot Topic Early Bird — vision language model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Amanpreet Singh , Ronghang Hu , Vedanuj Goswami , Guillaume Couairon , Wojciech Galuba , Marcus Rohrbach , Douwe Kiela

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Deep Learning > Models > Foundation Models Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

transformer architecture transfer learning multimodal learning cross-modal learning multi-modal learning vision language model cross-modal alignment foundation model vision language alignment foundational model

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022