MLS: A Large-Scale Multilingual Dataset for Speech Research

Vineel Pratap; Qiantong Xu; Anuroop Sriram; Gabriel Synnaeve; Ronan Collobert

2020 INTERSPEECH INTERSPEECH 2020

MLS: A Large-Scale Multilingual Dataset for Speech Research

Abstract

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 32K hours of English and a total of 4.5K hours for other languages. We provide baseline Automatic Speech Recognition (ASR) models and Language Models (LM) for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org

🌉 Interdisciplinary Bridge — Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — speech research

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Vineel Pratap , Qiantong Xu , Anuroop Sriram , Gabriel Synnaeve , Ronan Collobert

Topics

Natural Language Processing > Resources & Methods > Multilingual NLP Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Synthesis > Text-to-Speech

Keywords

language modeling automatic speech recognition multilingual dataset speech corpus speech research audiobook corpus

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020