An Empirical Study of Autoregressive Pre-training from Videos

Jathushan Rajasegaran; Ilija Radosavovic; Rahul Ravishankar; Yossi Gandelsman; Christoph Feichtenhofer; Jitendra Malik

2025 ICCV ICCV 2025

An Empirical Study of Autoregressive Pre-training from Videos

Abstract

We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jathushan Rajasegaran , Ilija Radosavovic , Rahul Ravishankar , Yossi Gandelsman , Christoph Feichtenhofer , Jitendra Malik

Topics

Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Architectures > Transformers Computer Vision > Analysis > 3D Vision Machine Learning > Learning Types > Representation Learning Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Representation Learning

Keywords

representation learning autoregressive model scaling law visual token video model video representation learning visual tokenization autoregressive pre-training

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025