The JHU ASR System for VOiCES from a Distance Challenge 2019

Yiming Wang; David Snyder; Hainan Xu; Vimal Manohar; Phani Sankar Nidadavolu; Daniel Povey; Sanjeev Khudanpur

2019 INTERSPEECH INTERSPEECH 2019

The JHU ASR System for VOiCES from a Distance Challenge 2019

Abstract

This paper describes the system developed by the JHU team for automatic speech recognition (ASR) of the VOiCES from a Distance Challenge 2019, focusing on single channel distant/farfield audio under noisy conditions. We participated in the Fixed Condition track, where the systems are only trained on an 80-hour subset of the Librispeech corpus provided by the organizer. The training data was first augmented with both background noises and simulated reverberation. We then trained factorized TDNN acoustic models that differed only in their use of i-vectors for adaptation. Both systems utilized RNN language models trained on original and reversed text for rescoring. We submitted three systems: the system using i-vectors with WER 19.4% on the development set, the system without i-vectors that achieved WER 19.0%, and the their lattice-level fusion with WER 17.8%. On the evaluation set, our best system achieves 23.9% WER.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🧭 Keyword Pioneer — factorized tdnn

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Yiming Wang , David Snyder , Hainan Xu , Vimal Manohar , Phani Sankar Nidadavolu , Daniel Povey , Sanjeev Khudanpur

Topics

Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

automatic speech recognition word error rate speech augmentation rnn language model factorized tdnn

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019