Text-to-speech synthesis using spectral modeling based on non-negative autoencoder

Takeru Gorai; Daisuke Saito; Nobuaki Minematsu

2022 INTERSPEECH INTERSPEECH 2022

Text-to-speech synthesis using spectral modeling based on non-negative autoencoder

Abstract

This paper proposes a statistical parametric speech synthesis system that uses non-negative autoencoder (NAE) for spectral modeling. NAE is a model that extends non-negative matrix factorization (NMF) as neural networks. In the proposed method, we employ latent variables in NAE as acoustic features. Reconstruction of spectral information and estimation of latent variables are simultaneously trained. The non-negativity of latent variables in NAE is expected to contribute to dimensionality reduction such that the fine structure of the spectral envelopes is preserved. Experimental results demonstrates the effectiveness of the proposed framework. We also study multispeaker modeling where each of NAEs corresponds to each single speaker. In addition, a neural source-filter (NSF) model was applied to the waveform generation. When a neural vocoder is trained with natural acoustic features and tested with synthesized features, quality degradation occurs due to the mismatch between training and test data. In order to mitigate the mismatch, this system uses features obtained by reconstructing natural speech using NAE for training. Experimental results show that reconstructed features are similar to synthesized features, and as a result, the quality of the synthesized speech is improved.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — non-negative autoencoder

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Takeru Gorai , Daisuke Saito , Nobuaki Minematsu

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Autoencoders

Keywords

latent variable text-to-speech synthesis spectral modeling non-negative autoencoder neural source-filter model

Download PDF

Related papers

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis 2022

Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset 2022

Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications 2022

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction 2022