2019 INTERSPEECH INTERSPEECH 2019

Multi-Stride Self-Attention for Speech Recognition

Abstract

In contrast to the huge success of self-attention based neural networks in various NLP tasks, the efficacy of self-attention in speech applications is yet limited. This is partly because the full effectiveness of the self-attention mechanism could not be achieved without proper down-sampling schemes in speech tasks. To address this issue, we propose a new self-attention mechanism suitable for speech recognition, namely, multi-stride self-attention. The proposed multi-stride approach lets each group of heads in self-attention process speech frames with a unique stride over neighboring frames. Thus, the entire attention mechanism would not be confined in a fixed frame shift and can have diverse contextual views for a given frame to determine attention weights more effectively. To validate our proposal we evaluated it on various speech corpora for speech recognition, both English and Chinese, and observed a consistent improvement, especially in terms of substitution and deletion errors, without the increase of model complexity. The average WER improvement of 7.5% (relative) obtained by the TDNNs having the multi-stride self-attention layer as compared to the baseline TDNN model shows the effectiveness of the proposed multi-stride self-attention mechanism.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio
🧭 Keyword Pioneer — multi-stride attention
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio