Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

Awni Hannun; Ann Lee; Qiantong Xu; Ronan Collobert

2019 INTERSPEECH INTERSPEECH 2019

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

Abstract

We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient than a strong RNN baseline. Key to our approach is a time-depth separable convolution block which dramatically reduces the number of parameters in the model while keeping the receptive field large. We also give a stable and efficient beam search inference procedure which allows us to effectively integrate a language model. Coupled with a convolutional language model, our time-depth separable convolution architecture improves by more than 22% relative WER over the best previously reported sequence-to-sequence results on the noisy LibriSpeech test set.

🧭 Keyword Pioneer — time-depth separable convolution

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

Authors

Awni Hannun , Ann Lee , Qiantong Xu , Ronan Collobert

Topics

Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Recognition > Speech Recognition

Keywords

speech recognition automatic speech recognition convolutional neural network encoder-decoder architecture sequence-to-sequence model time-depth separable convolution convolutional language model beam search inference sequence-to-sequence speech recognition

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019