ASR2K: Speech Recognition for Around 2000 Languages without Audio

Xinjian Li; Florian Metze; David R. Mortensen; Alan W Black; Shinji Watanabe

2022 INTERSPEECH INTERSPEECH 2022

ASR2K: Speech Recognition for Around 2000 Languages without Audio

Abstract

Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using only 10000 raw text utterances.

🌉 Interdisciplinary Bridge — Natural Language Processing and Speech & Audio

🐣 Hot Topic Early Bird — multilingual model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Xinjian Li , Florian Metze , David R. Mortensen , Alan W Black , Shinji Watanabe

Topics

Machine Learning > Learning Types > Zero-Shot Learning Natural Language Processing > Resources & Methods > Multilingual NLP Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Recognition > Speech Recognition Machine Learning > Learning Types > Multi-Modal Learning

Keywords

zero-shot learning speech recognition automatic speech recognition acoustic model low-resource language language model multilingual model n-gram statistics

Download PDF

Related papers

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis 2022

Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset 2022

Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications 2022

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction 2022