2024
ACL
ACL 2024
Towards a Clean Text Corpus for Ottoman Turkish
Abstract
AbstractOttoman Turkish, as a historical variant of modern Turkish, suffers from a scarcity of available corpora and NLP models. This paper outlines our pioneering endeavors to address this gap by constructing a clean text corpus of Ottoman Turkish materials. We detail the challenges encountered in this process and offer potential solutions. Additionally, we present a case study wherein the created corpus is employed in continual pre-training of BERTurk, followed by evaluation of the model’s performance on the named entity recognition task for Ottoman Turkish. Preliminary experimental results suggest the effectiveness of our corpus in adapting existing models developed for modern Turkish to historical Turkish.
🌉
Interdisciplinary Bridge
— Artificial Intelligence and Interdisciplinary and Machine Learning and Natural Language Processing
🧭
Keyword Pioneer
— ottoman turkish
🐣
Hot Topic Early Bird
— continual pretraining
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
Authors
Topics
Machine Learning > Application Areas > Domain Adaptation
Natural Language Processing > Understanding > Named Entity Recognition
Natural Language Processing > Resources & Methods > Multilingual NLP
Interdisciplinary > Linguistics > Computational Linguistics
Natural Language Processing > Resources & Methods > Transfer Learning
Artificial Intelligence > Core AI > Language