Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data

Erica Cooper; Xinyue Wang; Alison Chang; Yocheved Levitan; Julia Hirschberg

2017 INTERSPEECH INTERSPEECH 2017

Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data

Abstract

This paper describes experiments in training HMM-based text-to-speech (TTS) voices on data collected for Automatic Speech Recognition (ASR) training. We compare a number of filtering techniques designed to identify the best utterances from a noisy, multi-speaker corpus for training voices, to exclude speech containing noise and to include speech close in nature to more traditionally-collected TTS corpora. We also evaluate the use of automatic speech recognizers for intelligibility assessment in comparison with crowdsourcing methods. While the goal of this work is to develop natural-sounding and intelligible TTS voices in Low Resource Languages (LRLs) rapidly and easily, without the expense of recording data specifically for this purpose, we focus on English initially to identify the best filtering techniques and evaluation methods. We find that, when a large amount of data is available, selecting from the corpus based on criteria such as standard deviation of f0, fast speaking rate, and hypo-articulation produces the most intelligible voices.

🧭 Keyword Pioneer — utterance selection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Erica Cooper , Xinyue Wang , Alison Chang , Yocheved Levitan , Julia Hirschberg

Topics

Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Synthesis > Text-to-Speech

Keywords

automatic speech recognition speech intelligibility text-to-speech synthesis utterance selection low resource language hmm-based voice

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017