Mobile Video Diffusion

Haitam Ben Yahia; Denis Korzhenkov; Ioannis Lelekas; Amir Ghodrati; Amirhossein Habibian

2025 ICCV ICCV 2025

Mobile Video Diffusion

Abstract

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized image-to-video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce the computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schemas to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, can generate latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro, with negligible quality loss. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — mobile generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Haitam Ben Yahia , Denis Korzhenkov , Ioannis Lelekas , Amir Ghodrati , Amirhossein Habibian

Topics

Machine Learning > Application Areas > Efficient Computing Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation Deep Learning > Optimization & Theory > Model Compression

Keywords

model compression video generation efficient computing model pruning adversarial finetuning video diffusion mobile deployment video diffusion model channel pruning latent generation mobile generation temporal block pruning

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025