2020 INTERSPEECH INTERSPEECH 2020

Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-Based LVCSR

Abstract

Transformer has shown impressive performance in automatic speech recognition. It uses an encoder-decoder structure with self-attention to learn the relationship between high-level representation of source inputs and embedding of target outputs. In this paper, we propose a novel decoder structure that features a self-and-mixed attention decoder (SMAD) with a deep acoustic structure (DAS) to improve the acoustic representation of Transformer-based LVCSR. Specifically, we introduce a self-attention mechanism to learn a multi-layer deep acoustic structure for multiple levels of acoustic abstraction. We also design a mixed attention mechanism that learns the alignment between different levels of acoustic abstraction and its corresponding linguistic information simultaneously in a shared embedding space. The ASR experiments on Aishell-1 show that the proposed structure achieves CERs of 4.8% on the dev set and 5.1% on the test set, which are the best reported results on this task to the best of our knowledge.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio
🧭 Keyword Pioneer — attention decoder
🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio
🐣 Hot Topic Early Bird — transformer decoder