SIDGAN: High-Resolution Dubbed Video Generation via Shift-Invariant Learning

Urwa Muaz; Wondong Jang; Rohun Tripathi; Santhosh Mani; Wenbin Ouyang; Ravi Teja Gadde; Baris Gecer; Sergio Elizondo; Reza Madad; Naveen Nair

2023 ICCV ICCV 2023

SIDGAN: High-Resolution Dubbed Video Generation via Shift-Invariant Learning

Abstract

Dubbed video generation aims to accurately synchronize mouth movements of a given facial video with driving audio while preserving identity and scene-specific visual dynamics, such as head pose and lighting. Despite the accurate lip generation of previous approaches that adopts a pretrained audio-video synchronization metric as an objective function, called Sync-Loss, extending it to high-resolution videos was challenging due to shift biases in the loss landscape that inhibit tandem optimization of Sync-Loss and visual quality, leading to a loss of detail. To address this issue, we introduce shift-invariant learning, which generates photo-realistic high-resolution videos with accurate Lip-Sync. Further, we employ a pyramid network with coarse-to-fine image generation to improve stability and lip syncronization. Our model outperforms state-of-the-art methods on multiple benchmark datasets, including AVSpeech, HDTF, and LRW, in terms of photo-realism, identity preservation, and Lip-Sync accuracy.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐣 Hot Topic Early Bird — identity preservation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Urwa Muaz , Wondong Jang , Rohun Tripathi , Santhosh Mani , Wenbin Ouyang , Ravi Teja Gadde , Baris Gecer , Sergio Elizondo , Reza Madad , Naveen Nair

Topics

Deep Learning > Models > Generative Models Computer Vision > Generation > Video Generation

Keywords

video generation identity preservation audio-visual synchronization lip synchronization face generation shift-invariant learning

Download PDF

Related papers

PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework 2023

Periodically Exchange Teacher-Student for Source-Free Object Detection 2023

Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations 2023

Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles 2023

3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation 2023