RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech

rohan badlani; Rafael Valle; Kevin J. Shih; Joao Felipe Santos; Siddharth Gururani; Bryan Catanzaro

2023 INTERSPEECH INTERSPEECH 2023

RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech

Abstract

We create a multilingual speech synthesis system that can generate speech with a native accent in any seen language while retaining the characteristics of an individual's voice. It is expensive to obtain bilingual training data for a speaker and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present RADMMM, a speech synthesis model based on RADTTS with explicit control over accent, language, speaker, and fine-grained F0 and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 languages, with one native speaker per language. Human subjective evaluation demonstrates that, when compared to controlled baselines, our model better retains a speaker's voice and target accent, while synthesizing fluent speech in all target languages and accents in our dataset.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Speech & Audio

🧭 Keyword Pioneer — voice retention

🐣 Hot Topic Early Bird — multilingual speech

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Security & Privacy, Speech & Audio