2023 INTERSPEECH INTERSPEECH 2023

GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech

Abstract

Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to synthesize speech with a specific reference tim- bre or style that is never trained in the target language. It encounters the following challenges: 1) timbre and pronun- ciation are correlated since multilingual speech of a specific speaker is usually hard to obtain; 2) style and pronunciation are mixed because the speech style contains language-agnostic and language-specific parts. To address these challenges, we pro- pose GenerTTS, which mainly includes the following works: 1) we elaborately design a HuBERT-based information bottleneck to disentangle timbre and pronunciation/style; 2) we minimize the mutual information between style and language to discard the language-specific information in the style embedding. The experiments indicate that GenerTTS outperforms baseline sys- tems in terms of style similarity and pronunciation accuracy, and enables cross-lingual timbre and style generalization.

🌉 Interdisciplinary Bridge — Interdisciplinary and Machine Learning
🧭 Keyword Pioneer — cross-lingual timbre
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio