Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Takaaki Saeki; Soumi Maiti; Xinjian Li; Shinji Watanabe; Shinnosuke Takamichi; Hiroshi Saruwatari

2023 IJCAI IJCAI 2023

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Abstract

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.

🌉 Interdisciplinary Bridge — Deep Learning and Natural Language Processing

🐣 Hot Topic Early Bird — multilingual processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Takaaki Saeki , Soumi Maiti , Xinjian Li , Shinji Watanabe , Shinnosuke Takamichi , Hiroshi Saruwatari

Topics

Deep Learning > Models > Generative Models Deep Learning > Techniques > Pretraining Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

zero-shot learning unsupervised pretraining cross-lingual transfer multilingual processing language model

Download PDF

Related papers

Analyzing Intentional Behavior in Autonomous Agents under Uncertainty 2023

Deep Hashing-based Dynamic Stock Correlation Estimation via Normalizing Flow 2023

U-Match: Two-view Correspondence Learning with Hierarchy-aware Local Context Aggregation 2023

Artificial Agents Inspired by Human Motivation Psychology for Teamwork in Hazardous Environments 2023

Proportionally Fair Online Allocation of Public Goods with Predictions 2023