TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Minghao Li; Tengchao Lv; Jingye Chen; Lei Cui; Yijuan Lu; Dinei Florencio; Cha Zhang; Zhoujun Li; Furu Wei

2023 AAAI AAAI 2023

TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Abstract

Abstract Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at https://aka.ms/trocr.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — wordpiece level text generation

🐣 Hot Topic Early Bird — image understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Minghao Li , Tengchao Lv , Jingye Chen , Lei Cui , Yijuan Lu , Dinei Florencio , Cha Zhang , Zhoujun Li , Furu Wei

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Natural Language Processing > Applications > Text Generation Computer Vision > Domain-Specific > Document Analysis Computer Vision > Processing > Image Processing Deep Learning > Models > Transformers Computer Vision > Applications > Document Analysis

Keywords

transformer architecture pre-trained model image understanding optical character recognition text recognition image transformer text transformer pre trained model transformer based model wordpiece level text generation

Download PDF

Related papers

A Model-Agnostic Heuristics for Selective Classification 2023

Tackling Safe and Efficient Multi-Agent Reinforcement Learning via Dynamic Shielding (Student Abstract) 2023

Head-Free Lightweight Semantic Segmentation with Linear Transformer 2023

Hierarchical ConViT with Attention-Based Relational Reasoner for Visual Analogical Reasoning 2023

Deep Spiking Neural Networks with High Representation Similarity Model Visual Pathways of Macaque and Mouse 2023