GRAM: Global Reasoning for Multi-Page VQA

Tsachi Blau; Sharon Fogel; Roi Ronen; Alona Golts; Roy Ganz; Elad Ben Avraham; Aviad Aberdam; Shahar Tsiper; Ron Litman

2024 CVPR CVPR 2024

GRAM: Global Reasoning for Multi-Page VQA

Abstract

The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA) leading methods focus on the single-page setting while documents can span hundreds of pages. We present GRAM a method that seamlessly extends pre-trained single-page models to the multi-page setting without requiring computationally-heavy pretraining. To do so we leverage a single-page encoder for local page-level understanding and enhance it with document-level designated layers and learnable tokens facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens we propose a tailored bias adaptation method. For additional computational savings during decoding we introduce an optional compression stage using our compression-transformer (CFormer)reducing the encoded sequence length thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA demonstrating the effectiveness of our approach.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🐣 Hot Topic Early Bird — visual understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tsachi Blau , Sharon Fogel , Roi Ronen , Alona Golts , Roy Ganz , Elad Ben Avraham , Aviad Aberdam , Shahar Tsiper , Ron Litman

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers Natural Language Processing > Applications > Machine Reading Comprehension Natural Language Processing > Applications > Visual Question Answering Computer Vision > Applications > Question Answering

Keywords

visual question answering information extraction document understanding visual understanding multi-page document document visual question answering global reasoning

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024