2024 INTERSPEECH INTERSPEECH 2024

All Ears: Building Self-Supervised Learning based ASR models for Indian Languages at scale

Abstract

The abundance of unlabeled speech and its ease of collection calls for the development of self-supervised learning (SSL) based speech foundation models, which have been effective across several downstream speech tasks. As a part of this work, we curate 29.5K hours of raw speech data across 24 Indian languages and multiple domains, to pre-train SSL models over 5 different architectures. We then fine-tune these models for the downstream Automatic Speech Recognition (ASR) task on 13 Indian languages and evaluate them over diverse benchmarks. In addition we measure the efficacy of these models by evaluating them over the SUPERB benchmark. Our work signifies the need for careful choice of the SSL objectives while emphasizing the benefits of multilingual pretraining. Our pre-trained models out-perform baseline models such as MMS-300M and IndicWav2Vec by 17.3% and 36.0% relative WER improvements respectively, on Indian language ASR.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio