Real-Time End-to-End Monaural Multi-Speaker Speech Recognition

Song Li; Beibei Ouyang; Fuchuan Tong; Dexin Liao; Lin Li; Qingyang Hong

2021 INTERSPEECH INTERSPEECH 2021

Real-Time End-to-End Monaural Multi-Speaker Speech Recognition

Abstract

The rising interest in single-channel multi-speaker speech separation has triggered the development of end-to-end multi-speaker automatic speech recognition (ASR). However, until now, most systems have adopted autoregressive mechanisms for decoding, resulting in slow decoding speed, which is not conducive to the application of multi-speaker speech recognition in real-world environments. In this paper, we first comprehensively investigate and compare the mainstream end-to-end multi-speaker speech recognition systems. Secondly, we improve the recently proposed non-autoregressive end-to-end speech recognition model Mask-CTC, and introduce it to multi-speaker speech recognition to achieve real-time decoding. Our experiments on the LibriMix data set show that under the premise of the same amount of parameters, the non-autoregressive model achieves performance close to that of the autoregressive model while having a faster decoding speed.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — mask-ctc model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Song Li , Beibei Ouyang , Fuchuan Tong , Dexin Liao , Lin Li , Qingyang Hong

Topics

Machine Learning > Application Areas > Efficient Computing Deep Learning > Techniques > Model Architecture Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

speech separation real-time decoding non-autoregressive model multi-speaker speech recognition mask-ctc model

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021