DocFormer: End-to-End Transformer for Document Understanding

Srikar Appalaraju; Bhavan Jasani; Bhargava Urala Kota; Yusheng Xie; R. Manmatha

2021 ICCV ICCV 2021

DocFormer: End-to-End Transformer for Document Understanding

Abstract

We present DocFormer - a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats(forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters)

🌉 Interdisciplinary Bridge — Computer Science and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — multi-modal self-attention

🐣 Hot Topic Early Bird — document analysis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Srikar Appalaraju , Bhavan Jasani , Bhargava Urala Kota , Yusheng Xie , R. Manmatha

Topics

Deep Learning > Architectures > Transformers Computer Science > Applications > Document Analysis Computer Vision > Domain-Specific > Document Analysis Natural Language Processing > Resources & Methods > Multimodal NLP

Keywords

unsupervised pretraining multi-modal learning document analysis visual document understanding spatial embedding text recognition multi-modal transformer multi-modal self-attention

Download PDF

Related papers

Spatial-Temporal Transformer for Dynamic Scene Graph Generation 2021

ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators 2021

A Broad Study on the Transferability of Visual Representations With Contrastive Learning 2021

Query Adaptive Few-Shot Object Detection With Heterogeneous Graph Convolutional Networks 2021

Self-Supervised Neural Networks for Spectral Snapshot Compressive Imaging 2021