Multi-modal Adversarial Training for Zero-Shot Voice Cloning

John Janiczek; Dading Chong; Dongyang Dai; Arlo Faria; Chao Wang; Tao Wang; Yuzong Liu

2024 INTERSPEECH INTERSPEECH 2024

Multi-modal Adversarial Training for Zero-Shot Voice Cloning

Abstract

A text-to-speech (TTS) model trained to reconstruct speech given text tends towards predictions that are close to the average characteristics of a dataset, failing to model the variations that make human speech sound natural. This problem is magnified for zero-shot voice cloning, a task that requires training data with high variance in speaking styles. We build off of recent works which have used Generative Advsarial Networks (GAN) by proposing a Transformer encoder-decoder architecture to conditionally discriminates between real and generated speech features. The discriminator is used in a training pipeline that improves both the acoustic and prosodic features of a TTS model. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zero-shot voice cloning. Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

John Janiczek , Dading Chong , Dongyang Dai , Arlo Faria , Chao Wang , Tao Wang , Yuzong Liu

Topics

Machine Learning > Learning Types > Adversarial Learning Machine Learning > Learning Types > Zero-Shot Learning Speech & Audio > Synthesis > Text-to-Speech

Keywords

zero-shot learning adversarial training text-to-speech synthesis prosodic feature voice cloning speaker similarity

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024