Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors
Abstract
The paper presents a mechanism to perform speaker adaptation in speech synthesis based on deep neural networks (DNNs). The mechanism extracts speaker identification vectors, so-called d-vectors, from the training speakers and uses them jointly with the linguistic features to train a multi-speaker DNN-based text-to-speech synthesizer (DNN-TTS). The d-vectors are derived by applying principal component analysis (PCA) on the bottle-neck features of a speaker classifier network. At the adaptation stage, three variants are explored: (1) d-vectors calculated using data from the target speaker, or (2) d-vectors calculated as a weighted sum of d-vectors from training speakers, or (3) d-vectors calculated as an average of the above two approaches. The proposed method of unsupervised adaptation using the d-vector is compared with the commonly used i-vector based approach for speaker adaptation. Listening tests show that: (1) for speech quality, the d-vector based approach is significantly preferred over the i-vector based approach. All the d-vector variants perform similar for speech quality; (2) for speaker similarity, both d-vector and i-vector based adaptation were found to perform similar, except a small significant preference for the d-vector calculated as an average over the i-vector.