MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

Kim Sung-Bin; Lee Chae-Yeon; Gihun Son; Oh Hyun-Bin; Janghoon Ju; Suekyeong Nam; Tae-Hyun Oh

2024 INTERSPEECH INTERSPEECH 2024

MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

Abstract

Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new multilingual 2D video dataset comprising over 420 hours of talking videos in 20 languages. With our proposed dataset, we present a multilingual enhanced model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assessing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed dataset significantly enhances its multilingual performance. Codes and datasets are available at https://multitalk.github.io/.

🧭 Keyword Pioneer — 3d talking head generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing

Authors

Kim Sung-Bin , Lee Chae-Yeon , Gihun Son , Oh Hyun-Bin , Janghoon Ju , Suekyeong Nam , Tae-Hyun Oh

Topics

Computer Vision > Analysis > Face Recognition Computer Vision > Generation > Video Generation

Keywords

language-specific embedding multilingual video 3d talking head generation mouth movement facial movement

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024