Validating Label Consistency in NER Data Annotation

Qingkai Zeng; Mengxia Yu; Wenhao Yu; Tianwen Jiang; Meng Jiang

2021 EMNLP EMNLP 2021

Validating Label Consistency in NER Data Annotation

Abstract

AbstractData annotation plays a crucial role in ensuring your named entity recognition (NER) projects are trained with the right information to learn from. Producing the most accurate labels is a challenge due to the complexity involved with annotation. Label inconsistency between multiple subsets of data annotation (e.g., training set and test set, or multiple training subsets) is an indicator of label mistakes. In this work, we present an empirical method to explore the relationship between label (in-)consistency and NER model performance. It can be used to validate the label consistency (or catches the inconsistency) in multiple sets of NER data annotation. In experiments, our method identified the label inconsistency of test data in SCIERC and CoNLL03 datasets (with 26.7% and 5.4% label mistakes). It validated the consistency in the corrected version of both datasets.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — label mistake

🐣 Hot Topic Early Bird — data annotation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Qingkai Zeng , Mengxia Yu , Wenhao Yu , Tianwen Jiang , Meng Jiang

Topics

Machine Learning > Application Areas > Data Augmentation Natural Language Processing > Understanding > Named Entity Recognition Natural Language Processing > Applications > Named Entity Recognition Machine Learning > Learning Types > Evaluation

Keywords

named entity recognition data annotation data quality annotation quality model performance label consistency label mistake

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021