FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

Chao Xu; Yang Liu; Jiazheng Xing; Weida Wang; Mingze Sun; Jun Dan; Tianxin Huang; Siyuan Li; Zhi-Qi Cheng; Ying Tai; Baigui Sun

2024 CVPR CVPR 2024

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

Abstract

In this paper we abstract the process of people hearing speech extracting meaningful cues and creating various dynamically audio-consistent talking faces termed Listening and Imagining into the task of high-fidelity diverse talking faces generation from a single audio. Specifically it involves two critical challenges: one is to effectively decouple identity content and emotion from entangled audio and the other is to maintain intra-video diversity and inter-video consistency. To tackle the issues we first dig out the intricate relationships among facial factors and simplify the decoupling process tailoring a Progressive Audio Disentanglement for accurate facial geometry and semantics learning where each stage incorporates a customized training module responsible for a specific factor. Secondly to achieve visually diverse and audio-synchronized animation solely from input audio within a single model we introduce the Controllable Coherent Frame generation which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models (LDMs) to focus on maintaining facial geometry and semantics as well as texture and temporal coherence between frames. In this way we inherit high-quality diverse generation from LDMs while significantly improving their controllability at a low training cost. Extensive experiments demonstrate the flexibility and effectiveness of our method in handling this paradigm. The codes will be released at https://github.com/modelscope/facechain.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — audio disentanglement

🐣 Hot Topic Early Bird — temporal coherence

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chao Xu , Yang Liu , Jiazheng Xing , Weida Wang , Mingze Sun , Jun Dan , Tianxin Huang , Siyuan Li , Zhi-Qi Cheng , Ying Tai , Baigui Sun

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Diffusion Models Computer Vision > Generation > Image Generation

Keywords

talking face generation temporal coherence latent diffusion model facial geometry audio disentanglement

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024