Multimodal Autoregressive Pre-training of Large Vision Encoders

Enrico Fini; Mustafa Shukor; Xiujun Li; Philipp Dufter; Michal Klein; David Haldimann; Sai Aitharaju; Victor G. Turrisi da Costa; Louis Béthune; Zhe Gan; Alexander Toshev; Marcin Eichner; Moin Nabi; Yinfei Yang; Joshua Susskind; Alaaeldin El-Nouby

2025 CVPR CVPR 2025

Multimodal Autoregressive Pre-training of Large Vision Encoders

Abstract

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Fur- thermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal im- age understanding across diverse settings.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — multimodal decoder

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Enrico Fini , Mustafa Shukor , Xiujun Li , Philipp Dufter , Michal Klein , David Haldimann , Sai Aitharaju , Victor G. Turrisi da Costa , Louis Béthune , Zhe Gan , Alexander Toshev , Marcin Eichner , Moin Nabi , Yinfei Yang , Joshua Susskind , Alaaeldin El-Nouby

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Deep Learning > Techniques > Pretraining Deep Learning > Models > Transformers Deep Learning > Models > Vision-Language Models

Keywords

contrastive learning multimodal learning foundation model image understanding vision encoder autoregressive pre-training multimodal decoder frozen trunk

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025