2017 INTERSPEECH INTERSPEECH 2017

An RNN-Based Quantized F0 Model with Multi-Tier Feedback Links for Text-to-Speech Synthesis

Abstract

A recurrent-neural-network-based F0 model for text-to-speech (TTS) synthesis that generates F0 contours given textual features is proposed. In contrast to related F0 models, the proposed one is designed to learn the temporal correlation of F0 contours at multiple levels. The frame-level correlation is covered by feeding back the F0 output of the previous frame as the additional input of the current frame; meanwhile, the correlation over long-time spans is similarly modeled but by using F0 features aggregated over the phoneme and syllable. Another difference is that the output of the proposed model is not the interpolated continuous-valued F0 contour but rather a sequence of discrete symbols, including quantized F0 levels and a symbol for the unvoiced condition. By using the discrete F0 symbols, the proposed model avoids the influence of artificially interpolated F0 curves. Experiments demonstrated that the proposed F0 model, which was trained using a dropout strategy, generated smooth F0 contours with relatively better perceived quality than those from baseline RNN models.

🧭 Keyword Pioneer — multi-tier feedback
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio