PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

Shuhao Guan; Moule Lin; Cheng Xu; Xinyi Liu; Jinman Zhao; Jiexin Fan; Qi Xu; Derek Greene

2025 ACL ACL 2025

PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

Abstract

AbstractThis paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents.First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors.Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.

🧭 Keyword Pioneer — document image restoration

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Interdisciplinary, Machine Learning, Natural Language Processing

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Natural Language Processing

Authors

Shuhao Guan , Moule Lin , Cheng Xu , Xinyi Liu , Jinman Zhao , Jiexin Fan , Qi Xu , Derek Greene

Topics

Computer Vision > Processing > Image Restoration Computer Vision > Processing > Video Processing Natural Language Processing > Applications > Text Generation Computer Vision > Domain-Specific > Document Analysis Deep Learning > Models > Sequence-to-Sequence

Keywords

image synthesis image processing optical character recognition document image restoration historical document post-ocr correction text extraction text correction historical document analysis text-to-speech model

Download PDF

Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights 2025

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision 2025

Structural Deep Encoding for Table Question Answering 2025

Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating 2025

PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

Abstract

Authors

Topics

Keywords

Related papers