MultiCoNER: A Large-scale Multilingual Dataset for Complex Named Entity Recognition

Shervin Malmasi; Anjie Fang; Besnik Fetahu; Sudipta Kar; Oleg Rokhlenko

2022 COLING COLING 2022

MultiCoNER: A Large-scale Multilingual Dataset for Complex Named Entity Recognition

Abstract

AbstractWe present AnonData, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We tested the performance of two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art NER GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%). GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30%) and demonstrates the difficulty of our dataset. AnonData poses challenges even for large pre-trained language models, and we believe that it can help further research in building robust NER systems.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — low-context scenario

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Shervin Malmasi , Anjie Fang , Besnik Fetahu , Sudipta Kar , Oleg Rokhlenko

Topics

Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Learning Types > Supervised Learning Natural Language Processing > Applications > Named Entity Recognition Machine Learning > Learning Types > Classification Deep Learning > Models > Transformers

Keywords

multilingual nlp named entity recognition fine-grained recognition multilingual dataset low-context scenario

Download PDF

Related papers

MulZDG: Multilingual Code-Switching Framework for Zero-shot Dialogue Generation 2022

The Role of Context and Uncertainty in Shallow Discourse Parsing 2022

SelfMix: Robust Learning against Textual Label Noise with Self-Mixup Training 2022

Complicate Then Simplify: A Novel Way to Explore Pre-trained Models for Text Classification 2022

Repo4QA: Answering Coding Questions via Dense Retrieval on GitHub Repositories 2022