Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-Based LVCSR

Xinyuan Zhou; Grandee Lee; Emre Yılmaz; Yanhua Long; Jiaen Liang; Haizhou Li

2020 INTERSPEECH INTERSPEECH 2020

Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-Based LVCSR

Abstract

Transformer has shown impressive performance in automatic speech recognition. It uses an encoder-decoder structure with self-attention to learn the relationship between high-level representation of source inputs and embedding of target outputs. In this paper, we propose a novel decoder structure that features a self-and-mixed attention decoder (SMAD) with a deep acoustic structure (DAS) to improve the acoustic representation of Transformer-based LVCSR. Specifically, we introduce a self-attention mechanism to learn a multi-layer deep acoustic structure for multiple levels of acoustic abstraction. We also design a mixed attention mechanism that learns the alignment between different levels of acoustic abstraction and its corresponding linguistic information simultaneously in a shared embedding space. The ASR experiments on Aishell-1 show that the proposed structure achieves CERs of 4.8% on the dev set and 5.1% on the test set, which are the best reported results on this task to the best of our knowledge.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🧭 Keyword Pioneer — attention decoder

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio

🐣 Hot Topic Early Bird — transformer decoder

Authors

Xinyuan Zhou , Grandee Lee , Emre Yılmaz , Yanhua Long , Jiaen Liang , Haizhou Li

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Model Architecture Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Recognition > Speech Recognition

Keywords

self-attention mechanism attention mechanism automatic speech recognition transformer decoder large vocabulary speech recognition transformer-based speech recognition attention decoder acoustic structure mixed attention deep acoustic structure

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020