ParsTwiNER: A Corpus for Named Entity Recognition at Informal Persian

MohammadMahdi Aghajani; AliAkbar Badri; Hamid Beigy

2021 EMNLP EMNLP 2021

ParsTwiNER: A Corpus for Named Entity Recognition at Informal Persian

Abstract

AbstractAs a result of unstructured sentences and some misspellings and errors, finding named entities in a noisy environment such as social media takes much more effort. ParsTwiNER contains about 250k tokens, based on standard instructions like MUC-6 or CoNLL 2003, gathered from Persian Twitter. Using Cohen’s Kappa coefficient, the consistency of annotators is 0.95, a high score. In this study, we demonstrate that some state-of-the-art models degrade on these corpora, and trained a new model using parallel transfer learning based on the BERT architecture. Experimental results show that the model works well in informal Persian as well as in formal Persian.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — informal persian

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

MohammadMahdi Aghajani , AliAkbar Badri , Hamid Beigy

Topics

Deep Learning > Architectures > Transformers Natural Language Processing > Understanding > Named Entity Recognition Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Learning Types > Transfer Learning Natural Language Processing > Applications > Named Entity Recognition Natural Language Processing > Resources & Methods > Transfer Learning Deep Learning > Techniques > Transfer Learning

Keywords

transfer learning named entity recognition twitter datum social media text annotation consistency bert architecture informal text informal persian parallel transfer learning

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021