2019 INTERSPEECH INTERSPEECH 2019

Pre-Trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis

Abstract

In this paper, we propose a novel method to improve the performance and robustness of the front-end text processing modules of Mandarin text-to-speech (TTS) synthesis. We use pre-trained text encoding models, such as the encoder of a transformer based NMT model and BERT, to extract the latent semantic representations of words or characters and use them as input features for tasks in the front-end of TTS systems. Our experiments on the tasks of Mandarin polyphone disambiguation and prosodic structure prediction show that the proposed method can significantly improve the performances. Specifically, we get an absolute improvement of 0.013 and 0.027 in F1 score for prosodic word prediction and prosodic phrase prediction respectively, and an absolute improvement of 2.44% in polyphone disambiguation compared to previous methods.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing and Speech & Audio
🧭 Keyword Pioneer — polyphone disambiguation
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio