2022 INTERSPEECH INTERSPEECH 2022

Text-to-speech synthesis using spectral modeling based on non-negative autoencoder

Abstract

This paper proposes a statistical parametric speech synthesis system that uses non-negative autoencoder (NAE) for spectral modeling. NAE is a model that extends non-negative matrix factorization (NMF) as neural networks. In the proposed method, we employ latent variables in NAE as acoustic features. Reconstruction of spectral information and estimation of latent variables are simultaneously trained. The non-negativity of latent variables in NAE is expected to contribute to dimensionality reduction such that the fine structure of the spectral envelopes is preserved. Experimental results demonstrates the effectiveness of the proposed framework. We also study multispeaker modeling where each of NAEs corresponds to each single speaker. In addition, a neural source-filter (NSF) model was applied to the waveform generation. When a neural vocoder is trained with natural acoustic features and tested with synthesized features, quality degradation occurs due to the mismatch between training and test data. In order to mitigate the mismatch, this system uses features obtained by reconstructing natural speech using NAE for training. Experimental results show that reconstructed features are similar to synthesized features, and as a result, the quality of the synthesized speech is improved.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🧭 Keyword Pioneer — non-negative autoencoder
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio