Self-training ASR Guided by Unsupervised ASR Teacher

Hyung Yong Kim; Byeong-Yeol Kim; Yunkyu Lim; Jihwan Park; Shukjae Choi; Yooncheol Ju; Jinseok Park; Youshin Lim; Seung Woo Yu; Hanbin Lee; Shinji Watanabe

2024 INTERSPEECH INTERSPEECH 2024

Self-training ASR Guided by Unsupervised ASR Teacher

Abstract

Self-training has gained increasing attention due to its notable performance improvement in speech recognition. However, conventional self-training techniques have two key limitations: (1) labeled dataset is required for training a teacher to produce a pseudo-target, and (2) the first teacher trained with the small labeled dataset faces an over-fitting issue, generating noisy pseudo-targets for unseen datasets. Our approach adopts an unsupervised automatic speech recognition model as the teacher, thus solely utilizing the unlabeled dataset. As the proposed model also learns phonetic information from the UASR teacher at the intermediate layer, the pseudo-target at the higher layer contains more ASR-related information than that of Data2vec2. Experimental results on the LibriSpeech show that our model outperforms Data2vec2, the state-of-the-art self-supervised learning model, achieving 8.9% and 4.3% relative word error rate reduction on test-clean and test-other.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — unsupervised teacher

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Hyung Yong Kim , Byeong-Yeol Kim , Yunkyu Lim , Jihwan Park , Shukjae Choi , Yooncheol Ju , Jinseok Park , Youshin Lim , Seung Woo Yu , Hanbin Lee , Shinji Watanabe

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Semi-Supervised Learning Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

automatic speech recognition phonetic information unsupervised teacher pseudo-target generation intermediate layer

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024