2020 INTERSPEECH INTERSPEECH 2020

Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator

Abstract

Rapid unsupervised speaker adaptation in an E2E system posits us new challenges due to its end-to-end unified structure in addition to its intrinsic difficulty of data sparsity and imperfect label [1]. Previously we proposed utilizing the content relevant personalized speech synthesis for rapid speaker adaptation and achieved significant performance breakthrough in a hybrid system [2]. In this paper, we answer the following two questions: First, how to effectively perform rapid speaker adaptation in an RNN-T. Second, whether our previously proposed approach is still beneficial for the RNN-T and what are the modification and distinct observations. We apply the proposed methodology to a speaker adaptation task in a state-of-art presentation transcription RNN-T system. In the 1 min setup, it yields 11.58% or 7.95% relative word error rate (WER) reduction for the sup/unsup adaptation, comparing to the negligible gain when adapting with 1 min source speech. In the 10 min setup, it yields 15.71% or 8.00% relative WER reduction, doubling the gain of the source speech adaptation. We further apply various data filtering techniques and significantly bridge the gap between sup/unsup adaptation.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Speech & Audio
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio