Structure and Content-Guided Video Synthesis with Diffusion Models

Patrick Esser; Johnathan Chiu; Parmida Atighehchian; Jonathan Granskog; Anastasis Germanidis

2023 ICCV ICCV 2023

Structure and Content-Guided Video Synthesis with Diffusion Models

Abstract

Text-guided generative diffusion models unlock powerful image creation and editing tools. Recent approaches that edit the content of footage while retaining structure require expensive re-training for every input or rely on error-prone propagation of image edits across frames. In this work, we present a structure and content-guided video diffusion model that edits videos based on descriptions of the desired output. Conflicts between user-provided content edits and structure representations occur due to insufficient disentanglement between the two aspects. As a solution, we show that training on monocular depth estimates with varying levels of detail provides control over structure and content fidelity. A novel guidance method, enabled by joint video and image training, exposes explicit control over temporal consistency. Our experiments demonstrate a wide variety of successes; fine-grained control over output characteristics, customization based on a few reference images, and a strong user preference towards results by our model.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐣 Hot Topic Early Bird — video synthesis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Patrick Esser , Johnathan Chiu , Parmida Atighehchian , Jonathan Granskog , Anastasis Germanidis

Topics

Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation Deep Learning > Learning Types > Multimodal Learning

Keywords

video synthesis depth estimation diffusion model video editing temporal consistency text-guided generation

Download PDF

Related papers

PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework 2023

Periodically Exchange Teacher-Student for Source-Free Object Detection 2023

Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations 2023

Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles 2023

3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation 2023