ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Ahmed Masry; Megh Thakkar; Patrice Bechard; Sathwik Tejaswi Madhusudhan; Rabiul Awal; Shambhavi Mishra; Akshay Kalkunte Suresh; Srivatsava Daruru; Enamul Hoque; Spandana Gella; Torsten Scholak; Sai Rajeswar

2025 EMNLP EMNLP 2025

ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Abstract

AbstractRetrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — ocr pretraining

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ahmed Masry , Megh Thakkar , Patrice Bechard , Sathwik Tejaswi Madhusudhan , Rabiul Awal , Shambhavi Mishra , Akshay Kalkunte Suresh , Srivatsava Daruru , Enamul Hoque , Spandana Gella , Torsten Scholak , Sai Rajeswar

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Contrastive Learning

Keywords

contrastive learning multimodal learning document retrieval late interaction ocr pretraining

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025