SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models

Thong Nguyen; Yibin Lei; Jia-Huei Ju; Andrew Yates

2025 EMNLP EMNLP 2025

SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models

Abstract

AbstractVisual Document Retrieval (VDR) typically operates as text-to-image retrieval using specialized bi-encoders trained to directly embed document images. We revisit a zero-shot generate-and-encode pipeline: a vision–language model first produces a detailed textual description of each document image, which is then embedded by a standard text encoder. On the ViDoRe-v2 benchmark, the method reaches 63.4% nDCG@5, surpassing the strongest specialised multi-vector visual document encoder, and it scales similarly on MIRACL-VISION with broader multilingual coverage. Analysis shows that modern vision–language models capture complex textual and visual cues with sufficient granularity to act as a reusable semantic proxy. By off-loading modality alignment to pretrained vision–language models, our approach removes the need for computationally intensive text-image contrastive training and establishes a strong zero-shot baseline for future VDR systems.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Thong Nguyen , Yibin Lei , Jia-Huei Ju , Andrew Yates

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Learning Paradigms > Zero-Shot Learning Computer Vision > Core AI > Multimodal Learning Computer Vision > Domain-Specific > Document Analysis Deep Learning > Models > Vision-Language Models Computer Vision > Applications > Document Analysis

Keywords

zero-shot learning image retrieval semantic embedding vision-language model text embedding image-text retrieval text encoder text-to-image retrieval visual document retrieval

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025