Sequential Modeling Enables Scalable Learning for Large Vision Models

Yutong Bai; XINYANG GENG; Karttikeya Mangalam; Amir Bar; Alan L. Yuille; Trevor Darrell; Jitendra Malik; Alexei A. Efros

2024 CVPR CVPR 2024

Sequential Modeling Enables Scalable Learning for Large Vision Models

Abstract

We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this we define a common format "visual sentences" in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — visual sentence

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yutong Bai , XINYANG GENG , Karttikeya Mangalam , Amir Bar , Alan L. Yuille , Trevor Darrell , Jitendra Malik , Alexei A. Efros

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining

Keywords

representation learning sequential modeling cross-entropy loss large vision model next token prediction visual sentence

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024