An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets

Pilar Oplustil Gallegos; Jennifer Williams; Joanna Rownicka; Simon King

2020 INTERSPEECH INTERSPEECH 2020

An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets

Abstract

Large multi-speaker datasets for TTS typically contain diverse speakers, recording conditions, styles and quality of data. Although one might generally presume that more data is better, in this paper we show that a model trained on a carefully-chosen subset of speakers from LibriTTS provides significantly better quality synthetic speech than a model trained on a larger set. We propose an unsupervised methodology to find this subset by clustering per-speaker acoustic representations.

🧭 Keyword Pioneer — unsupervised methodology

🐝 Cross-Pollinator — Artificial Intelligence, Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Pilar Oplustil Gallegos , Jennifer Williams , Joanna Rownicka , Simon King

Topics

Machine Learning > Core Methods > Clustering Machine Learning > Learning Types > Unsupervised Learning

Keywords

acoustic representation speaker clustering unsupervised methodology speaker synthesis

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020