An Analysis of “Attention” in Sequence-to-Sequence Models

Rohit Prabhavalkar; Tara N. Sainath; Bo Li; Kanishka Rao; Navdeep Jaitly

2017 INTERSPEECH INTERSPEECH 2017

An Analysis of “Attention” in Sequence-to-Sequence Models

Abstract

In this paper, we conduct a detailed investigation of attention-based models for automatic speech recognition (ASR). First, we explore different types of attention, including “online” and “full-sequence” attention. Second, we explore different subword units to see how much of the end-to-end ASR process can reasonably be captured by an attention model. In experimental evaluations, we find that although attention is typically focused over a small region of the acoustics during each step of next label prediction, “full-sequence” attention outperforms “online” attention, although this gap can be significantly reduced by increasing the length of the segments over which attention is computed. Furthermore, we find that context-independent phonemes are a reasonable sub-word unit for attention models. When used in the second-pass to rescore N-best hypotheses, these models provide over a 10% relative improvement in word error rate.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rohit Prabhavalkar , Tara N. Sainath , Bo Li , Kanishka Rao , Navdeep Jaitly

Topics

Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Speech Recognition

Keywords

automatic speech recognition attention model subword unit

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017