Expressive Talking Head Generation With Granular Audio-Visual Control

Borong Liang; Yan Pan; Zhizhi Guo; Hang Zhou; Zhibin Hong; Xiaoguang Han; Junyu Han; Jingtuo Liu; Errui Ding; Jingdong Wang

2022 CVPR CVPR 2022

Expressive Talking Head Generation With Granular Audio-Visual Control

Abstract

Generating expressive talking heads is essential for creating virtual humans. However, existing one- or few-shot methods focus on lip-sync and head motion, ignoring the emotional expressions that make talking faces realistic. In this paper, we propose the Granularly Controlled Audio-Visual Talking Heads (GC-AVT), which controls lip movements, head poses, and facial expressions of a talking head in a granular manner. Our insight is to decouple the audio-visual driving sources through prior-based pre-processing designs. Detailedly, we disassemble the driving image into three complementary parts including: 1) a cropped mouth that facilitates lip-sync; 2) a masked head that implicitly learns pose; and 3) the upper face which works corporately and complementarily with a time-shifted mouth to contribute the expression. Interestingly, the encoded features from the three sources are integrally balanced through reconstruction training. Extensive experiments show that our method generates expressive faces with not only synced mouth shapes, controllable poses, but precisely animated emotional expressions as well.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — audio-visual control

🐣 Hot Topic Early Bird — audio-visual learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Borong Liang , Yan Pan , Zhizhi Guo , Hang Zhou , Zhibin Hong , Xiaoguang Han , Junyu Han , Jingtuo Liu , Errui Ding , Jingdong Wang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Generation > Image Generation Computer Vision > Generation > Video Generation Computer Vision > Processing > Image Editing Machine Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

image generation audio-visual learning image reconstruction talking head generation lip synchronization facial expression emotional expression head pose control audio-visual control facial expression animation

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022