Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection

Zhehuai Chen; Andrew Rosenberg; Yu Zhang; Gary Wang; Bhuvana Ramabhadran; Pedro J. Moreno

2020 INTERSPEECH INTERSPEECH 2020

Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection

Abstract

Text-to-Speech synthesis (TTS) based data augmentation is a relatively new mechanism for utilizing text-only data to improve automatic speech recognition (ASR) training without parameter or inference architecture changes. However, efforts to train speech recognition systems on synthesized utterances suffer from limited acoustic diversity of TTS outputs. Additionally, the text-only corpus is always much larger than the transcribed speech corpus by several orders of magnitude, which makes speech synthesis of all the text data impractical. In this work, we propose to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data. We also present a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text. We demonstrate that our proposed method allows ASR models to learn from synthesis of large-scale unspoken text sources and achieves a 35% relative WER reduction on a voice-search task.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

📈 Trend Setter — Contrastive Learning

🧭 Keyword Pioneer — acoustic diversity

🐣 Hot Topic Early Bird — contrastive learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Zhehuai Chen , Andrew Rosenberg , Yu Zhang , Gary Wang , Bhuvana Ramabhadran , Pedro J. Moreno

Topics

Machine Learning > Learning Types > Contrastive Learning Machine Learning > Application Areas > Data Augmentation Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

contrastive learning data augmentation automatic speech recognition generative adversarial network text-to-speech synthesis acoustic diversity

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020