Corpus Creation and Automatic Alignment of Historical Dutch Dialect Speech

Martijn Bentum; Eric Sanders; Antal P.J. van den Bosch; Douwe Zeldenrust; Henk van den Heuvel

2024 COLING COLING 2024

Corpus Creation and Automatic Alignment of Historical Dutch Dialect Speech

Abstract

AbstractThe Dutch Dialect Database (also known as the ‘Nederlandse Dialectenbank’) contains dialectal variations of Dutch that were recorded all over the Netherlands in the second half of the twentieth century. A subset of these recordings of about 300 hours were enriched with manual orthographic transcriptions, using non-standard approximations of dialectal speech. In this paper we describe the creation of a corpus containing both the audio recordings and their corresponding transcriptions and focus on our method for aligning the recordings with the transcriptions and the metadata.

🧭 Keyword Pioneer — dutch dialect

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio