Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis

Felipe Espic; Cassia Valentini Botinhao; Simon King

2017 INTERSPEECH INTERSPEECH 2017

Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis

Abstract

We propose a simple new representation for the FFT spectrum tailored to statistical parametric speech synthesis. It consists of four feature streams that describe magnitude, phase and fundamental frequency using real numbers. The proposed feature extraction method does not attempt to decompose the speech structure (e.g., into source+filter or harmonics+noise). By avoiding the simplifications inherent in decomposition, we can dramatically reduce the “phasiness” and “buzziness” typical of most vocoders. The method uses simple and computationally cheap operations and can operate at a lower frame rate than the 200 frames-per-second typical in many systems. It avoids heuristics and methods requiring approximate or iterative solutions, including phase unwrapping. Two DNN-based acoustic models were built — from male and female speech data — using the Merlin toolkit. Subjective comparisons were made with a state-of-the-art baseline, using the STRAIGHT vocoder. In all variants tested, and for both male and female voices, the proposed method substantially outperformed the baseline. We provide source code to enable our complete system to be replicated.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — magnitude spectrum

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Felipe Espic , Cassia Valentini Botinhao , Simon King

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Neural Networks Speech & Audio > Synthesis > Text-to-Speech Deep Learning > Learning Types > Deep Learning

Keywords

speech synthesis acoustic model deep neural network phase spectrum statistical parametric speech synthesis magnitude spectrum magnitude spectra phase spectra fft spectrum

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017