EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Yuxin Fang; Wen Wang; Binhui Xie; Quan Sun; Ledell Wu; Xinggang Wang; Tiejun Huang; Xinlong Wang; Yue Cao

2023 CVPR CVPR 2023

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Abstract

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVIS dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

📈 Trend Setter — Foundation Models

🧭 Keyword Pioneer — masked visual representation

🐣 Hot Topic Early Bird — vision foundation model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuxin Fang , Wen Wang , Binhui Xie , Quan Sun , Ledell Wu , Xinggang Wang , Tiejun Huang , Xinlong Wang , Yue Cao

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Architectures > Transformers Computer Vision > Analysis > Semantic Segmentation Computer Vision > Core AI Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Models > Foundation Models Computer Vision > Core AI > Foundation Models

Keywords

vision transformer transfer learning self-supervised learning multi-modal learning instance segmentation visual representation learning image recognition foundation model vision foundation model image-text alignment masked image modeling masked visual representation

Download PDF

Related papers

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching 2023

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars 2023

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos 2023

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement 2023

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata 2023