2020
INTERSPEECH
INTERSPEECH 2020
MLS: A Large-Scale Multilingual Dataset for Speech Research
Abstract
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 32K hours of English and a total of 4.5K hours for other languages. We provide baseline Automatic Speech Recognition (ASR) models and Language Models (LM) for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org
🌉
Interdisciplinary Bridge
— Natural Language Processing and Speech & Audio
🧭
Keyword Pioneer
— speech research
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio