2021 INTERSPEECH INTERSPEECH 2021

One Size Does Not Fit All in Resource-Constrained ASR

Abstract

The application of deep neural networks to the task of acoustic modeling for automatic speech recognition has resulted in dramatic decreases in ASR word error rates, enabling the use of this technology for interacting with smart phones and personal home assistants in high-resource languages. Developing ASR models of this caliber, however, requires hundreds or thousands of hours of transcribed speech recordings, which presents challenges for the vast majority of the world’s languages. In this paper, we investigate the utility of three distinct architectures that have previously been used for ASR in languages with limited training resources. We train and test these systems on publicly available ASR datasets for several typologically and orthographically diverse languages, which were produced under a variety of conditions using different speech collection strategies, practices, and equipment. Although these corpora are comparable in size, we find that no single ASR architecture outperforms all others. In addition, word error rates vary significantly, in some cases within the range of those typically reported for high-resource languages. Our results point to the importance of considering language-specific and corpus-specific factors and experimenting with multiple approaches when developing ASR systems for languages with limited training resources.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio