Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Haoyu Cao; Changcun Bao; Chaohu Liu; Huang Chen; Kun Yin; Hao Liu; Yinsong Liu; Deqiang Jiang; Xing Sun

2023 ICCV ICCV 2023

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Abstract

We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the vision tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks. SeRum represents a substantial advancement towards enabling efficient and effective end-to-end document understanding.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — vision token

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Haoyu Cao , Changcun Bao , Chaohu Liu , Huang Chen , Kun Yin , Hao Liu , Yinsong Liu , Deqiang Jiang , Xing Sun

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Efficient Computing

Keywords

document understanding end-to-end model vision token text spotting selective region concentration content-aware token merge

Download PDF

Related papers

PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework 2023

Periodically Exchange Teacher-Student for Source-Free Object Detection 2023

Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations 2023

Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles 2023

3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation 2023