VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Ryota Tanaka; Taichi Iki; Taku Hasegawa; Kyosuke Nishida; Kuniko Saito; Jun Suzuki

2025 CVPR CVPR 2025

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Abstract

We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ryota Tanaka , Taichi Iki , Taku Hasegawa , Kyosuke Nishida , Kuniko Saito , Jun Suzuki

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Applications > Question Answering Computer Vision > Domain-Specific > Document Analysis Natural Language Processing > Generation > Retrieval-Augmented Generation Deep Learning > Learning Types > Multi-Modal Learning

Keywords

visual question answering multimodal learning document understanding dense retrieval vision-language model retrieval-augmented generation self-supervised pre-training document visual question answering visually-rich document

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025