What happens in continued pre-training? Analysis of self-supervised speech models with continued pre-training for colloquial Finnish ASR

Yaroslav Getman; Tamás Grósz; Mikko Kurimo

2024 INTERSPEECH INTERSPEECH 2024

What happens in continued pre-training? Analysis of self-supervised speech models with continued pre-training for colloquial Finnish ASR

Abstract

The advancement of self-supervised learning has enabled the rapid development of highly accurate speech recognition models, such as wav2vec 2.0, for many languages. While high-resourced languages like English benefit from purely monolingual models, other, less-resourced ones must build upon multilingual foundations. In this work, we investigate various strategies to specialize models for the colloquial Finnish language and demonstrate that continued pre-training of available multilingual models is the best solution. Furthermore, we investigate the success of the pre-training procedure by examining the learned quantized representations and show how the continued pre-training improved the discovered latent codeword groups.

❓ The Questioner

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — quantized representation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Yaroslav Getman , Tamás Grósz , Mikko Kurimo

Topics

Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Techniques > Pretraining Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

speech recognition continued pre-training wav2vec 2.0 self-supervised speech model quantized representation

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024