BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus

Josh Meyer; David Adelani; Edresson Casanova; Alp Öktem; Daniel Whitenack; Julian Weber; Salomon Kabongo Kabenamualu; Elizabeth Salesky; Iroro Orife; Colin Leong; Perez Ogayo; Chris Chinenye Emezue; Jonathan Mukiibi; Salomey Osei; Apelete Agbolo; Victor Akinode; Bernard Opoku; Olanrewaju Samuel; Jesujoba Alabi; Shamsuddeen Hassan Muhammad

2022 INTERSPEECH INTERSPEECH 2022

BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus

Abstract

BibleTTS is a large, high-quality, open speech dataset for ten languages spoken in Sub-Saharan Africa. The corpus contains up to 86 hours of aligned, studio quality 48kHz single speaker recordings per language, enabling the development of high-quality text-to-speech models. The ten languages represented are: Akuapem Twi, Asante Twi, Chichewa, Ewe, Hausa, Kikuyu, Lingala, Luganda, Luo, and Yoruba. This corpus is a derivative work of Bible recordings made and released by the Open.Bible project from Biblica. We have aligned, cleaned, and filtered the original recordings, and additionally hand-checked a subset of the alignments for each language. We present results for text-to-speech models with Coqui TTS. The data is released under a commercial-friendly CC-BY-SA license.

👥 Mega-Team — 20 authors

🧭 Keyword Pioneer — multilingual tt

🐣 Hot Topic Early Bird — african language

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio