Towards Universal Text-to-Speech

Jingzhou Yang; Lei He

2020 INTERSPEECH INTERSPEECH 2020

Towards Universal Text-to-Speech

Abstract

This paper studies a multilingual sequence-to-sequence text-to-speech framework towards universal modeling, that is able to synthesize speech for any speaker in any language using a single model. This framework consists of a transformer-based acoustic predictor and a WaveNet neural vocoder, with global conditions from speaker and language networks. It is examined on a massive TTS data set with around 1250 hours of data from 50 language locales, and the amount of data in different locales is highly unbalanced. Although the multilingual model exhibits the transfer learning ability to benefit the low-resource languages, data imbalance still undermines the model performance. A data balance training strategy is successfully applied and effectively improves the voice quality of the low-resource languages. Furthermore, this paper examines the modeling capacity of extending to new speakers and languages, as a key step towards universal modeling. Experiments show 20 seconds of data is feasible for a new speaker and 6 minutes for a new language.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — universal modeling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🐣 Hot Topic Early Bird — multilingual speech

Authors

Jingzhou Yang , Lei He

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Transfer Learning Deep Learning > Architectures > Transformers Natural Language Processing > Generation > Text Generation Speech & Audio > Synthesis > Text-to-Speech

Keywords

transfer learning low-resource language neural vocoder multilingual speech waveform generation universal modeling multilingual synthesis

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020