2023 INTERSPEECH INTERSPEECH 2023

VoxTube: a multilingual speaker recognition dataset

Abstract

The objective of this paper is to advance the development of technologies in the fields of speaker recognition and speaker identification by introducing a large labeled audio database VoxTube collected from the open-source media. We propose a fully automated unsupervised approach for audio labeling that requires any pre-trained speaker recognition model. Collected with this approach from videos with CC BY license the VoxTube dataset contains more than 5.000 speakers with more than 4 million utterances pronounced in more than 10 languages. In our paper we show the VoxTube's high generalization ability across multiple domains by evaluating the accuracy metrics on various speaker recognition benchmarks. We also show how well this dataset complements an already existing VoxCeleb2 dataset.

🧭 Keyword Pioneer — speaker dataset
🐣 Hot Topic Early Bird — multilingual speech
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio