From Image to Video: An Empirical Study of Diffusion Representations

Pedro Vélez; Luisa F. Polanía; Yi Yang; Chuhan Zhang; Rishabh Kabra; Anurag Arnab; Mehdi S. M. Sajjadi

2025 ICCV ICCV 2025

From Image to Video: An Empirical Study of Diffusion Representations

Abstract

Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis.This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we analyze the performance of latent image and video diffusion representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. For the most informative comparison, we utilize the same model architecture, WALT, across image and video generation objectives. Our results show that video generation pre-training consistently outperforms its image counterpart, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Pedro Vélez , Luisa F. Polanía , Yi Yang , Chuhan Zhang , Rishabh Kabra , Anurag Arnab , Mehdi S. M. Sajjadi

Topics

Deep Learning > Models > Diffusion Models Deep Learning > Techniques > Pretraining Computer Vision > Analysis > Action Recognition Computer Vision > Core AI > Computer Vision Deep Learning > Learning Types > Representation Learning

Keywords

representation learning action recognition depth estimation video understanding pretraining strategy diffusion model visual understanding video representation

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025