Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors

Rama Doddipatla; Norbert Braunschweiler; Ranniery Maia

2017 INTERSPEECH INTERSPEECH 2017

Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors

Abstract

The paper presents a mechanism to perform speaker adaptation in speech synthesis based on deep neural networks (DNNs). The mechanism extracts speaker identification vectors, so-called d-vectors, from the training speakers and uses them jointly with the linguistic features to train a multi-speaker DNN-based text-to-speech synthesizer (DNN-TTS). The d-vectors are derived by applying principal component analysis (PCA) on the bottle-neck features of a speaker classifier network. At the adaptation stage, three variants are explored: (1) d-vectors calculated using data from the target speaker, or (2) d-vectors calculated as a weighted sum of d-vectors from training speakers, or (3) d-vectors calculated as an average of the above two approaches. The proposed method of unsupervised adaptation using the d-vector is compared with the commonly used i-vector based approach for speaker adaptation. Listening tests show that: (1) for speech quality, the d-vector based approach is significantly preferred over the i-vector based approach. All the d-vector variants perform similar for speech quality; (2) for speaker similarity, both d-vector and i-vector based adaptation were found to perform similar, except a small significant preference for the d-vector calculated as an average over the i-vector.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🐣 Hot Topic Early Bird — speaker embedding

Authors

Rama Doddipatla , Norbert Braunschweiler , Ranniery Maia

Topics

Artificial Intelligence > Learning Paradigms > Transfer Learning Machine Learning > Application Areas > Domain Adaptation Deep Learning > Architectures > Neural Networks Speech & Audio > Synthesis > Text-to-Speech Deep Learning > Learning Types > Transfer Learning

Keywords

principal component analysis speaker embedding deep neural network text-to-speech synthesis speaker identity speaker adaptation

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017