One Size Does Not Fit All in Resource-Constrained ASR

Ethan Morris; Robbie Jimerson; Emily Prud’hommeaux

2021 INTERSPEECH INTERSPEECH 2021

One Size Does Not Fit All in Resource-Constrained ASR

Abstract

The application of deep neural networks to the task of acoustic modeling for automatic speech recognition has resulted in dramatic decreases in ASR word error rates, enabling the use of this technology for interacting with smart phones and personal home assistants in high-resource languages. Developing ASR models of this caliber, however, requires hundreds or thousands of hours of transcribed speech recordings, which presents challenges for the vast majority of the world’s languages. In this paper, we investigate the utility of three distinct architectures that have previously been used for ASR in languages with limited training resources. We train and test these systems on publicly available ASR datasets for several typologically and orthographically diverse languages, which were produced under a variety of conditions using different speech collection strategies, practices, and equipment. Although these corpora are comparable in size, we find that no single ASR architecture outperforms all others. In addition, word error rates vary significantly, in some cases within the range of those typically reported for high-resource languages. Our results point to the importance of considering language-specific and corpus-specific factors and experimenting with multiple approaches when developing ASR systems for languages with limited training resources.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ethan Morris , Robbie Jimerson , Emily Prud’hommeaux

Topics

Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Recognition > Speech Recognition

Keywords

speech recognition automatic speech recognition acoustic modeling low-resource language deep neural network word error rate

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021