2026 WACV WACV 2026

CURIO: Curvature-Aligned and Efficient OCR for Low-Resource Historical Manuscripts

Abstract

We present CURIO, an OCR system for low-resource historical manuscripts. In many challenging cases, manuscripts feature curved text lines, unsegmented lines with lack of spacing between words, and highly variable line lengths -- conditions under which existing OCR methods fail. To tackle this challenge, we first extract lines and corresponding curvature profiles from manuscripts, then straighten them using a rectification procedure to reduce redundant background within each line. Because data is scarce, we compliment real data with synthetic data. To bridge the synthetic-real gap, we generate line images by warping rendered straight text along the rectified profiles, ensuring both real and synthetic lines align in their curvature characteristics. Our recognizer is a lightweight CNN-Transformer with padding-aware null activations, sparse attention and optimized with CTC loss for efficient training. We evaluate our method on challenging manuscript collections written in Sharada, a rare and endangered Indic script. CURIO outperforms strong CNN+RNN and Transformer baselines, with the largest gains on high-curvature lines and long lines. CURIO further transfers zero-shot to printed Sharada text, indicating robustness beyond manuscript domain.

🧭 Keyword Pioneer — curvature alignment
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio