Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis

Tomoki Hayashi; Shinji Watanabe; Tomoki Toda; Kazuya Takeda; Shubham Toshniwal; Karen Livescu

2019 INTERSPEECH INTERSPEECH 2019

Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis

Abstract

We propose an end-to-end text-to-speech (TTS) synthesis model that explicitly uses information from pre-trained embeddings of the text. Recent work in natural language processing has developed self-supervised representations of text that have proven very effective as pre-training for language understanding tasks. We propose using one such pre-trained representation (BERT) to encode input phrases, as an additional input to a Tacotron2-based sequence-to-sequence TTS model. We hypothesize that the text embeddings contain information about the semantics of the phrase and the importance of each word, which should help TTS systems produce more natural prosody and pronunciation. We conduct subjective listening tests of our proposed models using the 24-hour LJSpeech corpus, finding that they improve mean opinion scores modestly but significantly over a baseline TTS model without pre-trained text embedding input.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — pre-trained text embedding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Tomoki Hayashi , Shinji Watanabe , Tomoki Toda , Kazuya Takeda , Shubham Toshniwal , Karen Livescu

Topics

Machine Learning > Core Methods > Embedding Learning Deep Learning > Architectures > Neural Networks Natural Language Processing > Resources & Methods > Large Language Models Speech & Audio > Synthesis > Text-to-Speech

Keywords

prosody prediction text-to-speech synthesis sequence-to-sequence model bert embedding pre-trained embedding pre-trained text embedding bidirectional encoder representation from transformer

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019