Exploring the Robustness of Text-to-Speech Synthesis Based on Diffusion Probabilistic Models to Heavily Noisy Transcriptions

Jingyi Feng; Yusuke Yasuda; Tomoki Toda

2024 INTERSPEECH INTERSPEECH 2024

Exploring the Robustness of Text-to-Speech Synthesis Based on Diffusion Probabilistic Models to Heavily Noisy Transcriptions

Abstract

Large data volumes can benefit text-to-speech (TTS), but speech data with high-quality annotation is limited. Automatic transcription enables the transcription of found speech data to enhance the data volume for TTS, but TTS training suffers from transcription errors. In this paper, we investigate the robustness of typical TTS models against heavily noisy transcripts, including diffusion, flow, and autoregressive-based TTS models, in terms of objective intelligibility and subjective naturalness. Our experimental results show that diffusion-based TTS is extremely robust to heavily noisy transcriptions, mitigating about 30% of the word error rate compared to autoregressive and flow-based models. We also show that iterative inference with a long diffusion time is key to the robustness of diffusionbased TTS based on likelihood analysis.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Jingyi Feng , Yusuke Yasuda , Tomoki Toda

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Diffusion Models Deep Learning > Models > Generative Models

Keywords

model robustness autoregressive model diffusion probabilistic model text-to-speech synthesis noisy transcription

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024