2019 INTERSPEECH INTERSPEECH 2019

Improving Transformer-Based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation

Abstract

The end-to-end (E2E) model allows for training of automatic speech recognition (ASR) systems without having to consider the acoustic model, lexicon, language model and complicated decoding algorithms, which are integral to conventional ASR systems. Recently, the transformer-based E2E ASR model (ASR-Transformer) showed promising results in many speech recognition tasks. The most common practice is to stack a number of feed-forward layers in the encoder and decoder. As a result, the addition of new layers improves speech recognition performance significantly. However, this also leads to a large increase in the number of parameters and severe decoding latency. In this paper, we propose to reduce the model complexity by simply reusing parameters across all stacked layers instead of introducing new parameters per layer. In order to address the slight reduction in recognition quality we propose to augment the speech inputs with bags-of-attributes. As a result we obtain a highly compressed, efficient and high quality ASR model.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio
🧭 Keyword Pioneer — parameter efficiency
🐣 Hot Topic Early Bird — model compression
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio