2025 ICCV ICCV 2025

Learning Beyond Still Frames: Scaling Vision-Language Models with Video

Abstract

High-quality image-text data is critical in enhancing Vision-Language Models (VLMs), but traditional image-based pretraining approaches face limitations. These methods are resource-intensive, relying on curated, high-quality interleaved data that is costly and challenging to collect at scale. Additionally, while such datasets improve static image-text understanding, they fail to develop the temporal and motion comprehension needed for video understanding. To address these gaps, we propose incorporating video pretraining into VLMs to improve the model's ability to capture temporal dynamics and general visual perception, which requires reconciling spatial redundancy with strict temporal causality. Therefore, we propose Causal Hierarchical Aggregation to separate computation-heavy spatial encoding from lightweight temporal propagation and construct hierarchical receptive fields at varying granularities. As we scale video context to more than 100 billion tokens, our method excels in high throughput and state-of-the-art performances on both Image and Video understanding, as shown in Figure 1, providing a scalable solution to enhance multimodal learning in dynamic contexts.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio