Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory

Chunyang Wu; Yongqiang Wang; Yangyang Shi; Ching-Feng Yeh; Frank Zhang

2020 INTERSPEECH INTERSPEECH 2020

Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory

Abstract

Transformer-based acoustic modeling has achieved great success for both hybrid and sequence-to-sequence speech recognition. However, it requires access to the full sequence, and the computational cost grows quadratically with respect to the input sequence length. These factors limit its adoption for streaming applications. In this work, we proposed a novel augmented memory self-attention, which attends on a short segment of the input sequence and a bank of memories. The memory bank stores the embedding information for all the processed segments. On the librispeech benchmark, our proposed method outperforms all the existing streamable transformer methods by a large margin and achieved over 15% relative error reduction, compared with the widely used LC-BLSTM baseline. Our findings are also confirmed on some large internal datasets.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🧭 Keyword Pioneer — augmented memory

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Chunyang Wu , Yongqiang Wang , Yangyang Shi , Ching-Feng Yeh , Frank Zhang

Topics

Deep Learning > Architectures > Transformers Speech & Audio > Recognition > Speech Recognition Deep Learning > Learning Types > Deep Learning

Keywords

acoustic modeling acoustic model streaming mode augmented memory transformer acoustic model input sequence length

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020