A Hierarchical Predictor of Synthetic Speech Naturalness Using Neural Networks

Takenori Yoshimura; Gustav Eje Henter; Oliver Watts; Mirjam Wester; Junichi Yamagishi; Keiichi Tokuda

2016 INTERSPEECH INTERSPEECH 2016

A Hierarchical Predictor of Synthetic Speech Naturalness Using Neural Networks

Abstract

A problem when developing and tuning speech synthesis systems is that there is no well-established method of automatically rating the quality of the synthetic speech. This research attempts to obtain a new automated measure which is trained on the result of large-scale subjective evaluations employing many human listeners, i.e., the Blizzard Challenge. To exploit the data, we experiment with linear regression, feed-forward and convolutional neural network models, and combinations of them to regress from synthetic speech to the perceptual scores obtained from listeners. The biggest improvements were seen when combining stimulus- and system-level predictions.

🚀 Conference Pioneer — INTERSPEECH 2016

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — speech naturalness

🐣 Hot Topic Early Bird — convolutional neural network

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio