2017 INTERSPEECH INTERSPEECH 2017

Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors

Abstract

The paper presents a mechanism to perform speaker adaptation in speech synthesis based on deep neural networks (DNNs). The mechanism extracts speaker identification vectors, so-called d-vectors, from the training speakers and uses them jointly with the linguistic features to train a multi-speaker DNN-based text-to-speech synthesizer (DNN-TTS). The d-vectors are derived by applying principal component analysis (PCA) on the bottle-neck features of a speaker classifier network. At the adaptation stage, three variants are explored: (1) d-vectors calculated using data from the target speaker, or (2) d-vectors calculated as a weighted sum of d-vectors from training speakers, or (3) d-vectors calculated as an average of the above two approaches. The proposed method of unsupervised adaptation using the d-vector is compared with the commonly used i-vector based approach for speaker adaptation. Listening tests show that: (1) for speech quality, the d-vector based approach is significantly preferred over the i-vector based approach. All the d-vector variants perform similar for speech quality; (2) for speaker similarity, both d-vector and i-vector based adaptation were found to perform similar, except a small significant preference for the d-vector calculated as an average over the i-vector.

๐ŸŒ‰ Interdisciplinary Bridge โ€” Artificial Intelligence and Deep Learning and Machine Learning
๐Ÿ Cross-Pollinator โ€” Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio
๐Ÿฃ Hot Topic Early Bird โ€” speaker embedding