2025 ACL ACL 2025

Variety delights (sometimes) - Annotation differences in morphologically annotated corpora

Abstract

AbstractThe goal of annotation standards is to ensure consistency across different corpora and languages. But do they succeed? In our paper we experiment with morphologically annotated Hungarian corpora of different sizes (ELTE DH gold standard corpus, NYTK-NerKor, and Szeged Treebank) to assess their compatibility as a merged training corpus for morphological analysis and disambiguation. Our results show that combining any two corpora not only failed to improve the results of the trained tagger but even degraded them due the inconsistent annotations. Further analysis of the annotation differences among the corpora revealed inconsistencies of several sources: different theoretical approach, lack of consensus, and tagset conversion issues.

🌉 Interdisciplinary Bridge — Knowledge & Reasoning and Machine Learning
🧭 Keyword Pioneer — annotation consistency
🐝 Cross-Pollinator — Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing