Unsupervised Multi-View Post-OCR Error Correction With Language Models

Harsh Gupta; Luciano Del Corro; Samuel Broscheit; Johannes Hoffart; Eliot Brenner

2021 EMNLP EMNLP 2021

Unsupervised Multi-View Post-OCR Error Correction With Language Models

Abstract

AbstractWe investigate post-OCR correction in a setting where we have access to different OCR views of the same document. The goal of this study is to understand if a pretrained language model (LM) can be used in an unsupervised way to reconcile the different OCR views such that their combination contains fewer errors than each individual view. This approach is motivated by scenarios in which unconstrained text generation for error correction is too risky. We evaluated different pretrained LMs on two datasets and found significant gains in realistic scenarios with up to 15% WER improvement over the best OCR view. We also show the importance of domain adaptation for post-OCR correction on out-of-domain documents.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — post-ocr correction

🐣 Hot Topic Early Bird — error correction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Harsh Gupta , Luciano Del Corro , Samuel Broscheit , Johannes Hoffart , Eliot Brenner

Topics

Machine Learning > Learning Types > Unsupervised Learning Natural Language Processing > Generation > Text Generation Natural Language Processing > Applications > Text Processing

Keywords

unsupervised learning domain adaptation multi-view learning language model error correction word error rate post-ocr correction text correction ocr error correction

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021