Bilingual and Code-switching TTS Enhanced with Denoising Diffusion Model and GAN

Huai-Zhe Yang; Chia-Ping Chen; Shan-Yun He; Cheng-Ruei Li

2024 INTERSPEECH INTERSPEECH 2024

Bilingual and Code-switching TTS Enhanced with Denoising Diffusion Model and GAN

Abstract

In this paper, we propose a Mandarin-English bilingual and code-switching text-to-speech (TTS) system featuring a diffusion model and generative adversarial network (GAN) to improve the output speech. To address speaker consistency, we employ a feature separation architecture that converts language and speaker IDs into embeddings as input to the encoder. Subsequently, we employ two adversarial classifiers and two classifiers to separate language and speaker features. We integrate a modified diffusion model and discriminators to push for better speech quality and speaker consistency, especially for code-swtiching scenarios. On the MOS measure, the performance of the proposed TTS system differs only slightly from the ground truth data in monolingual speech and achieves MOS of 3.83 in the synthesis of code-switching speech.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — speaker consistency

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Huai-Zhe Yang , Chia-Ping Chen , Shan-Yun He , Cheng-Ruei Li

Topics

Machine Learning > Core Methods > Representation Learning Speech & Audio > Synthesis > Text-to-Speech

Keywords

generative adversarial network denoising diffusion model text-to-speech synthesis feature separation code-switching speech speaker consistency

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024