Articulatory Text-to-Speech Synthesis Using the Digital Waveguide Mesh Driven by a Deep Neural Network

Amelia J. Gully; Takenori Yoshimura; Damian T. Murphy; Kei Hashimoto; Yoshihiko Nankaku; Keiichi Tokuda

2017 INTERSPEECH INTERSPEECH 2017

Articulatory Text-to-Speech Synthesis Using the Digital Waveguide Mesh Driven by a Deep Neural Network

Abstract

Following recent advances in direct modeling of the speech waveform using a deep neural network, we propose a novel method that directly estimates a physical model of the vocal tract from the speech waveform, rather than magnetic resonance imaging data. This provides a clear relationship between the model and the size and shape of the vocal tract, offering considerable flexibility in terms of speech characteristics such as age and gender. Initial tests indicate that despite a highly simplified physical model, intelligible synthesized speech is obtained. This illustrates the potential of the combined technique for the control of physical models in general, and hence the generation of more natural-sounding synthetic speech.

🧭 Keyword Pioneer — speech waveform

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Amelia J. Gully , Takenori Yoshimura , Damian T. Murphy , Kei Hashimoto , Yoshihiko Nankaku , Keiichi Tokuda

Topics

Speech & Audio > Synthesis > Text-to-Speech

Keywords

deep neural network articulatory synthesis speech waveform vocal tract physical model

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017