Self-Distillation for Improving CTC-Transformer-Based ASR Systems

Takafumi Moriya; Tsubasa Ochiai; Shigeki Karita; Hiroshi Sato; Tomohiro Tanaka; Takanori Ashihara; Ryo Masumura; Yusuke Shinohara; Marc Delcroix

2020 INTERSPEECH INTERSPEECH 2020

Self-Distillation for Improving CTC-Transformer-Based ASR Systems

Abstract

We present a novel training approach for encoder-decoder-based sequence-to-sequence (S2S) models. S2S models have been used successfully by the automatic speech recognition (ASR) community. The important key factor of S2S is the attention mechanism as it captures the relationships between input and output sequences. The attention weights inform which time frames should be attended to for predicting the output labels. In previous work, we proposed distilling S2S knowledge into connectionist temporal classification (CTC) based models by using the attention characteristics to create pseudo-targets for an auxiliary cross-entropy loss term. This approach can significantly improve CTC models. However, it remained unclear whether our proposal could be used to improve S2S models. In this paper, we extend our previous work to create a strong S2S model, i.e. Transformer with CTC (CTC-Transformer). We utilize Transformer outputs and the source attention weights for making pseudo-targets that contain both the posterior and the timing information of each Transformer output. These pseudo-targets are used to train the shared encoder of the CTC-Transformer through the use of direct feedback from the Transformer-decoder and thus obtain more informative representations. Experiments on public and private datasets to perform various tasks demonstrate that our proposal is also effective for enhancing S2S model training. In particular, on a Japanese ASR task, our best system outperforms the previous state-of-the-art alternative.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Takafumi Moriya , Tsubasa Ochiai , Shigeki Karita , Hiroshi Sato , Tomohiro Tanaka , Takanori Ashihara , Ryo Masumura , Yusuke Shinohara , Marc Delcroix

Topics

Machine Learning > Application Areas > Knowledge Distillation Deep Learning > Architectures > Transformers Speech & Audio > Recognition > Automatic Speech Recognition Deep Learning > Techniques > Knowledge Distillation Deep Learning > Learning Types > Knowledge Distillation

Keywords

attention mechanism knowledge distillation

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020