2024 INTERSPEECH INTERSPEECH 2024

Self-training ASR Guided by Unsupervised ASR Teacher

Abstract

Self-training has gained increasing attention due to its notable performance improvement in speech recognition. However, conventional self-training techniques have two key limitations: (1) labeled dataset is required for training a teacher to produce a pseudo-target, and (2) the first teacher trained with the small labeled dataset faces an over-fitting issue, generating noisy pseudo-targets for unseen datasets. Our approach adopts an unsupervised automatic speech recognition model as the teacher, thus solely utilizing the unlabeled dataset. As the proposed model also learns phonetic information from the UASR teacher at the intermediate layer, the pseudo-target at the higher layer contains more ASR-related information than that of Data2vec2. Experimental results on the LibriSpeech show that our model outperforms Data2vec2, the state-of-the-art self-supervised learning model, achieving 8.9% and 4.3% relative word error rate reduction on test-clean and test-other.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio
🧭 Keyword Pioneer — unsupervised teacher
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio