Searchable Language Documentation Corpora: DoReCo meets TEITOK

Maarten Janssen; Frank Seifart

2025 ACL ACL 2025

Searchable Language Documentation Corpora: DoReCo meets TEITOK

Abstract

AbstractIn this paper, we describe a newly created searchable interface for DoReCo, a database that contains spoken corpora from a world-wide sample of 53, mostly lesser described languages, with audio, transcription, translation, and - for most languages - interlinear morpheme glosses. Until now, DoReCo data were available for download via the DoReCo website and via the Nakala repository in a number of different formats, but not directly accessible online. We created a graphical interface to view, listen to, and search these data online, providing direct and intuitive access for linguists and laypeople. The new interface uses the TEITOK corpus infrastructure to provide a number of different visualizations on individual documents in DoReCo and provides a search interface to perform detailed searches on individual languages. The use of TEITOK also enables the corpus for use with NLP pipelines, either using the data to train NLP models, or to use NLP models to enrich the data.

🌉 Interdisciplinary Bridge — Computer Science and Interdisciplinary and Natural Language Processing

🧭 Keyword Pioneer — morpheme gloss

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio