Mobile-Oriented Video Diffusion: Enabling Text-to-Video Generation on Mobile Devices Without Retraining, Compression, or Pruning
Abstract
We present Mobile-Oriented Video Diffusion (MOVD) framework, the first diffusion-based text-to-video generation framework designed for efficient on-device execution on smartphone-grade hardware without requiring retraining, compression or pruning of the target denoising model. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, MOVD applies two novel techniques to pretrained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) reduces the heavy computational load of attention layers by merging consecutive tokens along the temporal dimension. By integrating these techniques with Concurrent Inference with Dynamic Loading (CI-DL), which splits large models into smaller, executable segments for limited memory environments, MOVD allows a text-to-video diffusion generative model to run on an iPhone 15 Pro. We envision the proposed MOVD as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on mobile and embedded devices without resource-intensive optimization procedures.