2024 COLING COLING 2024

Philippine Languages Database: A Multilingual Speech Corpora for Developing Systems for Low-Resource Languages

Abstract

AbstractPrevious efforts to collect Filipino speech were done in the development of Filipino-Speech Corpus, TAGCO, and Filipino-Bisaya speech corpus. These corpora, however, are either domain-specific, non-parallel, non-multilingual or relatively insufficient for the development of state-of-the-art Automatic Speech Recognizers (ASR) and Text-To-Speech Systems (TTS) which usually requires hundreds of hours of speech data. This paper presents a multilingual corpora for the Philippine languages namely: Filipino, English, Cebuano, Kapampangan, Hiligaynon, Ilokano, Bikolano, Waray, and Tausug. PLD includes over 454 hours of recordings from speakers of the ten languages, covering multiple domains in news, medical, education, tourism and spontaneous speech. The applicability of the corpus has also been demonstrated in adult and children ASR, phoneme transcriber, voice conversion, and TTS applications.

๐ŸŒ‰ Interdisciplinary Bridge โ€” Natural Language Processing and Speech & Audio
๐Ÿ Cross-Pollinator โ€” Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio