Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Emanuela Boros; Ahmed Hamdi; Elvys Linhares Pontes; Luis Adrián Cabrera-Diego; José G. Moreno; Nicolas Sidere; Antoine Doucet

2020 EMNLP EMNLP 2020

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Abstract

AbstractThis paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.

🌉 Interdisciplinary Bridge — Deep Learning and Interdisciplinary and Natural Language Processing

🐣 Hot Topic Early Bird — optical character recognition

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Emanuela Boros , Ahmed Hamdi , Elvys Linhares Pontes , Luis Adrián Cabrera-Diego , José G. Moreno , Nicolas Sidere , Antoine Doucet

Topics

Deep Learning > Architectures > Transformers Natural Language Processing > Understanding > Named Entity Recognition Interdisciplinary > Science > Digital Humanities Natural Language Processing > Applications > Named Entity Recognition Deep Learning > Learning Types > Transfer Learning

Keywords

transfer learning named entity recognition optical character recognition historical document digitization error transformer model

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020