2025 WACV WACV 2025

LLM-Generated Rewrite and Context Modulation for Enhanced Vision Language Models in Digital Pathology

Abstract

Recent advancements in vision-language models (VLMs) have found important applications in medical imaging particularly in digital pathology. VLMs demand large-scale datasets of image-caption pairs which is often hard to obtain in medical domains. State-of-the-art VLMs in digital pathology have been pre-trained on datasets that are significantly smaller than their computer vision counterparts. Furthermore the caption of a pathology slide often refers to a small sub-set of features in the image--an important point that is ignored in existing VLM pre-training schemes. Another important issue that is under-appericated is that the performance of state-of-the-art VLMs in zero-shot classification tasks can be sensitive to the choice of the prompts. In this paper we first employ language rewrites using a large language model (LLM) to enrich a public pathology image-caption dataset and make it publicly available. Our extensive experiments demonstrate that by training with language rewrites we can boost the performance of a state-of-the-art digital pathology VLM on downstream tasks such as zero-shot classification and text-to-image and image-to-text retrieval. We further leverage LLMs to demonstrate the sensitivity of zero-shot classification results to the choice of prompts and propose a scalable approach to characterize this when comparing models. Finally we present a novel context modulation layer that adjusts the image embeddings for better aligning with the paired text and use context-specific language rewrites for training this layer. In our results we show that the proposed context modulation framework can further yield substantial performance gains.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio