2019 INTERSPEECH INTERSPEECH 2019

Multi-Task Multi-Resolution Char-to-BPE Cross-Attention Decoder for End-to-End Speech Recognition

Abstract

In this paper we present a new hierarchical character to byte-pair encoding (C2B) end-to-end neural network architecture for improving the performance of attention based encoder-decoder ASR models. We explore different strategies for building the hierarchical C2B models such as building the individual blocks one at a time, as well as training the entire model as a monolith in a single step. We show that C2B model trained simultaneously with four losses, two for character and two for BPE sequences help regularize the learning of character sequences as well as BPE sequences. The proposed multi-task multi-resolution hierarchical architecture improves the WER of a small footprint bidirectional full-attention E2E model on the 960 hours LibriSpeech corpus by around 15% relative and is comparable to the state-of-the-art performance of an almost 3 times bigger model on the same dataset.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio