Learning Beyond Still Frames: Scaling Vision-Language Models with Video

Yiyuan Zhang; Handong Li; Jing Liu; Xiangyu Yue

2025 ICCV ICCV 2025

Learning Beyond Still Frames: Scaling Vision-Language Models with Video

Abstract

High-quality image-text data is critical in enhancing Vision-Language Models (VLMs), but traditional image-based pretraining approaches face limitations. These methods are resource-intensive, relying on curated, high-quality interleaved data that is costly and challenging to collect at scale. Additionally, while such datasets improve static image-text understanding, they fail to develop the temporal and motion comprehension needed for video understanding. To address these gaps, we propose incorporating video pretraining into VLMs to improve the model's ability to capture temporal dynamics and general visual perception, which requires reconciling spatial redundancy with strict temporal causality. Therefore, we propose Causal Hierarchical Aggregation to separate computation-heavy spatial encoding from lightweight temporal propagation and construct hierarchical receptive fields at varying granularities. As we scale video context to more than 100 billion tokens, our method excels in high throughput and state-of-the-art performances on both Image and Video understanding, as shown in Figure 1, providing a scalable solution to enhance multimodal learning in dynamic contexts.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yiyuan Zhang , Handong Li , Jing Liu , Xiangyu Yue

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Optimization & Theory > Optimization Computer Vision > Processing > Video Understanding Computer Vision > Core AI > Multimodal Learning Deep Learning > Models > Foundation Models Deep Learning > Learning Types > Multi-Modal Learning

Keywords

temporal dynamics multimodal learning video understanding vision language model foundation model vision-language model video pretraining causal aggregation

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025