Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

Hao Guo; Xugong Qin; Jun Jie Ou Yang; Peng Zhang; Gangyan Zeng; Yubo Li; Hailun Lin

2025 CVPR CVPR 2025

Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

Abstract

Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — document image retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hao Guo , Xugong Qin , Jun Jie Ou Yang , Peng Zhang , Gangyan Zeng , Yubo Li , Hailun Lin

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning Natural Language Processing > Applications > Information Retrieval Computer Vision > Domain-Specific > Document Analysis Deep Learning > Learning Types > Contrastive Learning Natural Language Processing > Resources & Methods > Multimodal NLP

Keywords

zero-shot learning information retrieval benchmark dataset visual document understanding natural language query document image retrieval contrastive vision language model fine-grained semantic query

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025