LLM-Generated Rewrite and Context Modulation for Enhanced Vision Language Models in Digital Pathology

Cagla Deniz Bahadir; Gozde B. Akar; Mert R. Sabuncu

2025 WACV WACV 2025

LLM-Generated Rewrite and Context Modulation for Enhanced Vision Language Models in Digital Pathology

Abstract

Recent advancements in vision-language models (VLMs) have found important applications in medical imaging particularly in digital pathology. VLMs demand large-scale datasets of image-caption pairs which is often hard to obtain in medical domains. State-of-the-art VLMs in digital pathology have been pre-trained on datasets that are significantly smaller than their computer vision counterparts. Furthermore the caption of a pathology slide often refers to a small sub-set of features in the image--an important point that is ignored in existing VLM pre-training schemes. Another important issue that is under-appericated is that the performance of state-of-the-art VLMs in zero-shot classification tasks can be sensitive to the choice of the prompts. In this paper we first employ language rewrites using a large language model (LLM) to enrich a public pathology image-caption dataset and make it publicly available. Our extensive experiments demonstrate that by training with language rewrites we can boost the performance of a state-of-the-art digital pathology VLM on downstream tasks such as zero-shot classification and text-to-image and image-to-text retrieval. We further leverage LLMs to demonstrate the sensitivity of zero-shot classification results to the choice of prompts and propose a scalable approach to characterize this when comparing models. Finally we present a novel context modulation layer that adjusts the image embeddings for better aligning with the paired text and use context-specific language rewrites for training this layer. In our results we show that the proposed context modulation framework can further yield substantial performance gains.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Cagla Deniz Bahadir , Gozde B. Akar , Mert R. Sabuncu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Domain Adaptation Artificial Intelligence > Learning Paradigms > Zero-Shot Learning

Keywords

vision language model image-text retrieval zero-shot classification large language model context modulation

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025