2020
EMNLP
EMNLP 2020
OCR Post Correction for Endangered Language Texts
Abstract
AbstractThere is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.
🌱
Topic Pioneer
— Low-Resource Learning
🌉
Interdisciplinary Bridge
— Computer Science and Computer Vision and Machine Learning and Natural Language Processing
🧭
Keyword Pioneer
— ocr post correction
🐣
Hot Topic Early Bird
— document analysis
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio
Authors
Topics
Machine Learning > Learning Types > Unsupervised Learning
Machine Learning > Application Areas > Data Augmentation
Machine Learning > Application Areas > Domain Adaptation
Computer Science > Applications > Document Analysis
Computer Vision > Processing > Image Processing
Natural Language Processing > Applications > Document Analysis
Machine Learning > Learning Types > Low-Resource Learning