CLIP-IT: CLIP-based Pairing of Histology Images with Privileged Textual Information

Banafsheh Karimian; Giulia Avanzato; Soufiane Belharbi; Alexis Guichemerre; Luke McCaffrey; Mohammadhadi Shateri; Eric Granger

2026 WACV WACV 2026

CLIP-IT: CLIP-based Pairing of Histology Images with Privileged Textual Information

Abstract

Multimodal learning has shown promise in medical imaging, combining complementary modalities like images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt or text-based inference. Their practicality is therefore limited due to annotation cost, privacy, and compute demands. Unpaired external text, like pathology reports, can still provide complementary diagnostic cues if semantically relevant content is retrievable per image. To address this, we introduce CLIP-IT, a novel framework that relies on rich unpaired text reports. Specifically, CLIP-IT uses a CLIP model pre-trained on histology image-text pairs from a separate dataset to retrieve the most relevant unpaired textual report for each image in the downstream unimodal dataset. These reports, sourced from the same disease domain and tissue type, form pseudo-pairs that reflect shared clinical semantics rather than exact alignment. Knowledge from these texts is distilled into the vision model during training, while LoRA-based adaptation mitigates the semantic gap between unaligned modalities. At inference, only the vision model is used, maintaining low overhead while still benefiting from multimodal training without requiring paired data in the downstream dataset. Experiments show that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without requiring paired annotations per dataset or incurring additional inference-time complexity.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Healthcare & Medicine and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio