Open Source Speech and Language Resources for Frisian

Emre Yılmaz; Henk van den Heuvel; Jelske Dijkstra; Hans Van de Velde; Frederik Kampstra; Jouke Algra; David Van Leeuwen

2016 INTERSPEECH INTERSPEECH 2016

Open Source Speech and Language Resources for Frisian

Abstract

In this paper, we present several open source speech and language resources for the under-resourced Frisian language. Frisian is mostly spoken in the province of Fryslân which is located in the north of the Netherlands. The native speakers of Frisian are Frisian-Dutch bilingual and often code-switch in daily conversations. The resources presented in this paper include a code-switching speech database containing radio broadcasts, a phonetic lexicon with more than 70k words and a language model trained on a text corpus with more than 38M words. With this contribution, we aim to share the Frisian resources we have collected in the scope of the FAME! project, in which a spoken document retrieval system is built for the disclosure of the regional broadcaster’s radio archives. These resources enable research on code-switching and longitudinal speech and language change. Moreover, a sample automatic speech recognition (ASR) recipe for the Kaldi toolkit will also be provided online to facilitate the Frisian ASR research.

🚀 Conference Pioneer — INTERSPEECH 2016

🧭 Keyword Pioneer — speech corpus

🐣 Hot Topic Early Bird — language model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🌉 Interdisciplinary Bridge — Natural Language Processing and Speech & Audio