ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition

Jing Pan; Joshua Shapiro; Jeremy Wohlwend; Kyu J. Han; Tao Lei; Tao Ma

2020 INTERSPEECH INTERSPEECH 2020

ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition

Abstract

In this paper we present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures, a multistream CNN for acoustic modeling and a self-attentive simple recurrent unit (SRU) for language modeling. In the hybrid ASR framework, the multistream CNN acoustic model processes an input of speech frames in multiple parallel pipelines where each stream has a unique dilation rate for diversity. Trained with the SpecAugment data augmentation method, it achieves relative word error rate (WER) improvements of 4% on test-clean and 14% on test-other. We further improve the performance via N-best rescoring using a 24-layer self-attentive SRU language model, achieving WERs of 1.75% on test-clean and 4.46% on test-other.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — multistream convolutional neural network

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Jing Pan , Joshua Shapiro , Jeremy Wohlwend , Kyu J. Han , Tao Lei , Tao Ma

Topics

Machine Learning > Application Areas > Data Augmentation Deep Learning > Architectures > Neural Networks Deep Learning > Techniques > Pretraining Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

data augmentation language modeling acoustic modeling multistream convolutional neural network

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020