All Ears: Building Self-Supervised Learning based ASR models for Indian Languages at scale

Vasista Sai Lodagala; Abhishek Biswas; Shoutrik Das; Jordan F; S Umesh

2024 INTERSPEECH INTERSPEECH 2024

All Ears: Building Self-Supervised Learning based ASR models for Indian Languages at scale

Abstract

The abundance of unlabeled speech and its ease of collection calls for the development of self-supervised learning (SSL) based speech foundation models, which have been effective across several downstream speech tasks. As a part of this work, we curate 29.5K hours of raw speech data across 24 Indian languages and multiple domains, to pre-train SSL models over 5 different architectures. We then fine-tune these models for the downstream Automatic Speech Recognition (ASR) task on 13 Indian languages and evaluate them over diverse benchmarks. In addition we measure the efficacy of these models by evaluating them over the SUPERB benchmark. Our work signifies the need for careful choice of the SSL objectives while emphasizing the benefits of multilingual pretraining. Our pre-trained models out-perform baseline models such as MMS-300M and IndicWav2Vec by 17.3% and 36.0% relative WER improvements respectively, on Indian language ASR.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Vasista Sai Lodagala , Abhishek Biswas , Shoutrik Das , Jordan F , S Umesh

Topics

Machine Learning > Learning Types > Self-Supervised Learning Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

self-supervised learning automatic speech recognition language identification multilingual pretraining acoustic representation speech foundation model

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024