VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Changhan Wang; Morgane Riviere; Ann Lee; Anne Wu; Chaitanya Talnikar; Daniel Haziza; Mary Williamson; Juan Pino; Emmanuel Dupoux

2021 ACL ACL 2021

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Abstract

AbstractWe introduce VoxPopuli, a large-scale multilingual corpus providing 400K hours of unlabeled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 15 languages and their aligned oral interpretations into 15 target languages totaling 17.3K hours. We provide speech recognition (ASR) baselines and validate the versatility of VoxPopuli unlabeled data in semi-supervised ASR and speech-to-text translation under challenging out-of-domain settings. The corpus is available at https://github.com/facebookresearch/voxpopuli.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing and Speech & Audio

🐣 Hot Topic Early Bird — multilingual speech

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Changhan Wang , Morgane Riviere , Ann Lee , Anne Wu , Chaitanya Talnikar , Daniel Haziza , Mary Williamson , Juan Pino , Emmanuel Dupoux

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Semi-Supervised Learning Natural Language Processing > Resources & Methods > Multilingual NLP Speech & Audio > Recognition > Automatic Speech Recognition Deep Learning > Learning Types > Semi-Supervised Learning Speech & Audio > Recognition > Speech Translation

Keywords

unsupervised learning representation learning semi-supervised learning speech recognition automatic speech recognition multilingual speech speech-to-text translation speech corpus

Download PDF

Related papers

Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training 2021

A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification 2021

How Did This Get Funded?! Automatically Identifying Quirky Scientific Achievements 2021

Exploring Discourse Structures for Argument Impact Classification 2021

Language Embeddings for Typology and Cross-lingual Transfer Learning 2021