One-Shot Speaker Adaptation Based on Initialization by Generative Adversarial Networks for TTS

Jaeuk Lee; Joon-Hyuk Chang

2022 INTERSPEECH INTERSPEECH 2022

One-Shot Speaker Adaptation Based on Initialization by Generative Adversarial Networks for TTS

Abstract

Speaker adaptation for personalizing text-to-speech (TTS) has become increasingly important. Herein, we propose a novel adaptation using a few seconds of data obtained from an unseen speaker. We first use a speaker embedding lookup table to train a multi-speaker TTS model, wherein each speaker embedding in the lookup table contains information representing a speaker's timbre. We propose an initial embedding predictor that extracts initial embedding suitable for the adaptation of unseen speakers. We use trained speaker embeddings to train the initial embedding predictor. Further, adversarial training is applied to improve the performance. After adversarial training, the initial embedding predictor infers the unseen speaker's initial embedding, and it is fine-tuned. As the initial embedding contains timbre information of the unseen speaker, adaptation is achieved faster and with less data than with conventional methods. We validate the performance with a mean opinion score (MOS) and demonstrate that adaptation is feasible with only 5 s of data.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — voice personalization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jaeuk Lee , Joon-Hyuk Chang

Topics

Machine Learning > Learning Types > Adversarial Learning Speech & Audio > Synthesis > Text-to-Speech

Keywords

speaker embedding generative adversarial network speaker adaptation voice personalization

Download PDF

Related papers

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis 2022

Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset 2022

Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications 2022

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction 2022