Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

Haoli Bai; Zhiguang Liu; Xiaojun Meng; Li Wentao; Shuang Liu; Yifeng Luo; Nian Xie; Rongfu Zheng; Liangwei Wang; Lu Hou; Jiansheng Wei; Xin Jiang; Qun Liu

2023 ACL ACL 2023

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

Abstract

AbstractUnsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding (VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that Wukong-Reader brings superior performance on various VDU tasks in both English and Chinese. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

📈 Trend Setter — Multi-Modal Learning

🧭 Keyword Pioneer — textline-region contrastive learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Haoli Bai , Zhiguang Liu , Xiaojun Meng , Li Wentao , Shuang Liu , Yifeng Luo , Nian Xie , Rongfu Zheng , Liangwei Wang , Lu Hou , Jiansheng Wei , Xin Jiang , Qun Liu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Contrastive Learning Computer Vision > Core AI > Multimodal Learning Computer Vision > Domain-Specific > Document Analysis Natural Language Processing > Applications > Document Analysis Deep Learning > Models > Multi-Modal Learning Computer Vision > Applications > Document Analysis

Keywords

multi-modal learning vision-language model visual document understanding optical character recognition fine-grained alignment multi-modal pre-training textline-region contrastive learning document textline masked region modeling

Download PDF

History Semantic Graph Enhanced Conversational KBQA with Temporal Information Modeling 2023

Efficient Transformers with Dynamic Token Pooling 2023

HHU at SemEval-2023 Task 3: An Adapter-based Approach for News Genre Classification 2023

NAP at SemEval-2023 Task 3: Is Less Really More? (Back-)Translation as Data Augmentation Strategies for Detecting Persuasion Techniques 2023

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

Abstract

Authors

Topics

Keywords

Related papers