Improving Transformer-Based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation

Sheng Li; Dabre Raj; Xugang Lu; Peng Shen; Tatsuya Kawahara; Hisashi Kawai

2019 INTERSPEECH INTERSPEECH 2019

Improving Transformer-Based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation

Abstract

The end-to-end (E2E) model allows for training of automatic speech recognition (ASR) systems without having to consider the acoustic model, lexicon, language model and complicated decoding algorithms, which are integral to conventional ASR systems. Recently, the transformer-based E2E ASR model (ASR-Transformer) showed promising results in many speech recognition tasks. The most common practice is to stack a number of feed-forward layers in the encoder and decoder. As a result, the addition of new layers improves speech recognition performance significantly. However, this also leads to a large increase in the number of parameters and severe decoding latency. In this paper, we propose to reduce the model complexity by simply reusing parameters across all stacked layers instead of introducing new parameters per layer. In order to address the slight reduction in recognition quality we propose to augment the speech inputs with bags-of-attributes. As a result we obtain a highly compressed, efficient and high quality ASR model.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — parameter efficiency

🐣 Hot Topic Early Bird — model compression

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Sheng Li , Dabre Raj , Xugang Lu , Peng Shen , Tatsuya Kawahara , Hisashi Kawai

Topics

Machine Learning > Application Areas > Data Augmentation Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

model compression automatic speech recognition parameter sharing parameter efficiency end-to-end model feed-forward layer speech augmentation transformer-based asr speech attributes augmentation

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019