Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Songwei Ge; Seungjun Nah; Guilin Liu; Tyler Poon; Andrew Tao; Bryan Catanzaro; David Jacobs; Jia-Bin Huang; Ming-Yu Liu; Yogesh Balaji

2023 ICCV ICCV 2023

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Abstract

Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own COrrelation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a 10x smaller model using significantly less computation than the prior art.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — noise prior

🐣 Hot Topic Early Bird — video diffusion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Songwei Ge , Seungjun Nah , Guilin Liu , Tyler Poon , Andrew Tao , Bryan Catanzaro , David Jacobs , Jia-Bin Huang , Ming-Yu Liu , Yogesh Balaji

Topics

Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation

Keywords

video generation video diffusion temporal coherence noise prior

Download PDF

Related papers

PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework 2023

Periodically Exchange Teacher-Student for Source-Free Object Detection 2023

Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations 2023

Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles 2023

3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation 2023