Image-Caption Encoding for Improving Zero-Shot Generalization

Eric Yu; Christopher Liao; Sathvik Ravi; Theodoros Tsiligkaridis; Brian Kulis

2025 WACV WACV 2025

Image-Caption Encoding for Improving Zero-Shot Generalization

Abstract

Recent advances in vision-language models have combined contrastive approaches with generative methods to achieve state-of-the-art (SOTA) on downstream inference tasks like zero-shot image classification. However a persistent issue of these models for image classification is their out-of-distribution (OOD) generalization capabilities. We first show that when an OOD datapoint is misclassified the correct class can be typically found in the Top-K predicted classes. In order to steer the model prediction toward the correct class within the top predicted classes we propose the Image-Caption Encoding (ICE) method a straightforward approach that directly enforces consistency between the image-conditioned and caption-conditioned predictions at evaluation time only. Intuitively we take advantage of unique properties of the generated captions to guide our local search for the correct class label within the Top-K predicted classes. We show that our method can be easily combined with other SOTA methods to enhance Top-1 OOD accuracies by 0.5% on average and up to 3% on challenging datasets.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — image caption encoding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Eric Yu , Christopher Liao , Sathvik Ravi , Theodoros Tsiligkaridis , Brian Kulis

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Transfer Learning Machine Learning > Learning Types > Zero-Shot Learning Machine Learning > Application Areas > Domain Generalization Artificial Intelligence > Learning Paradigms > Zero-Shot Learning Machine Learning > Learning Types > Domain Adaptation Deep Learning > Learning Types > Zero-Shot Learning

Keywords

contrastive learning image captioning out-of-distribution generalization vision language model vision-language model zero-shot classification top-k prediction image caption encoding

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025