Self-Distillation for Improving CTC-Transformer-Based ASR Systems
Abstract
We present a novel training approach for encoder-decoder-based sequence-to-sequence (S2S) models. S2S models have been used successfully by the automatic speech recognition (ASR) community. The important key factor of S2S is the attention mechanism as it captures the relationships between input and output sequences. The attention weights inform which time frames should be attended to for predicting the output labels. In previous work, we proposed distilling S2S knowledge into connectionist temporal classification (CTC) based models by using the attention characteristics to create pseudo-targets for an auxiliary cross-entropy loss term. This approach can significantly improve CTC models. However, it remained unclear whether our proposal could be used to improve S2S models. In this paper, we extend our previous work to create a strong S2S model, i.e. Transformer with CTC (CTC-Transformer). We utilize Transformer outputs and the source attention weights for making pseudo-targets that contain both the posterior and the timing information of each Transformer output. These pseudo-targets are used to train the shared encoder of the CTC-Transformer through the use of direct feedback from the Transformer-decoder and thus obtain more informative representations. Experiments on public and private datasets to perform various tasks demonstrate that our proposal is also effective for enhancing S2S model training. In particular, on a Japanese ASR task, our best system outperforms the previous state-of-the-art alternative.