2023 INTERSPEECH INTERSPEECH 2023

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Abstract

End-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and mask-predict models. There are pros and cons to each of these architectures, and thus practitioners may switch between these different models depending on application requirements. Instead of building separate models, we propose a joint modeling scheme where four different decoders (CTC, attention, RNN-T, mask-predict) share an encoder - we refer to this as 4D modeling. Additionally, we propose to 1) train 4D models using a two-stage strategy which stabilizes multitask learning and 2) decode 4D models using a novel time-synchronous one-pass beam search. We demonstrate that jointly trained 4D models improve the performances of each individual decoder. Further, we show that our joint CTC/RNN-T/attention decoding surpasses the previously proposed CTC/attention decoding.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio