Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Mingbo Ma; Baigong Zheng; Kaibo Liu; Renjie Zheng; Hairong Liu; Kainan Peng; Kenneth Church; Liang Huang

2020 EMNLP EMNLP 2020

Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Abstract

AbstractText-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audios with high naturalness. However, these efforts still suffer from two types of latencies: (a) the computational latency (synthesizing time), which grows linearly with the sentence length, and (b) the input latency in scenarios where the input text is incrementally available (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we propose a neural incremental TTS approach using the prefix-to-prefix framework from simultaneous translation. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an O(1) rather than O(n) latency. Experiments on English and Chinese TTS show that our approach achieves similar speech naturalness compared to full sentence TTS, but only with a constant (1-2 words) latency.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Speech & Audio

🧭 Keyword Pioneer — prefix-to-prefix framework

🐣 Hot Topic Early Bird — simultaneous translation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Mingbo Ma , Baigong Zheng , Kaibo Liu , Renjie Zheng , Hairong Liu , Kainan Peng , Kenneth Church , Liang Huang

Topics

Artificial Intelligence > Core AI > Foundation Models Speech & Audio > Synthesis > Text-to-Speech Deep Learning > Learning Types > Representation Learning

Keywords

latency reduction text-to-speech synthesis simultaneous translation neural text-to-speech incremental synthesis prefix-to-prefix framework online synthesis simultaneous synthesis

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020