Multi-Task Multi-Resolution Char-to-BPE Cross-Attention Decoder for End-to-End Speech Recognition

Dhananjaya Gowda; Abhinav Garg; Kwangyoun Kim; Mehul Kumar; Chanwoo Kim

2019 INTERSPEECH INTERSPEECH 2019

Multi-Task Multi-Resolution Char-to-BPE Cross-Attention Decoder for End-to-End Speech Recognition

Abstract

In this paper we present a new hierarchical character to byte-pair encoding (C2B) end-to-end neural network architecture for improving the performance of attention based encoder-decoder ASR models. We explore different strategies for building the hierarchical C2B models such as building the individual blocks one at a time, as well as training the entire model as a monolith in a single step. We show that C2B model trained simultaneously with four losses, two for character and two for BPE sequences help regularize the learning of character sequences as well as BPE sequences. The proposed multi-task multi-resolution hierarchical architecture improves the WER of a small footprint bidirectional full-attention E2E model on the 960 hours LibriSpeech corpus by around 15% relative and is comparable to the state-of-the-art performance of an almost 3 times bigger model on the same dataset.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dhananjaya Gowda , Abhinav Garg , Kwangyoun Kim , Mehul Kumar , Chanwoo Kim

Topics

Deep Learning > Architectures > Transformers Speech & Audio > Recognition > Speech Recognition Machine Learning > Learning Types > Multi-Task Learning

Keywords

multi-task learning attention mechanism encoder-decoder architecture byte-pair encoding character-level modeling end-to-end speech recognition

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019