Retaining and Enhancing Pre-Trained Knowledge in Vision-Language Models with Prompt Ensembling

Donggeun Kim; Yujin Jo; Myungjoo Lee; Taesup Kim

2025 WACV WACV 2025

Retaining and Enhancing Pre-Trained Knowledge in Vision-Language Models with Prompt Ensembling

Abstract

The advancement of vision-language models particularly the Contrastive Language-Image Pre-training (CLIP) model has revolutionized the field of machine learning by enabling robust zero-shot learning capabilities. These capabilities allow models to understand and respond to previously unseen data without task-specific training. However adapting CLIP to integrate specialized knowledge from various domains while retaining its zero-shot capabilities remains a significant challenge. To address this we introduce a novel prompt ensemble learning approach called Group-wise Prompt Ensemble (GPE). This method aims to enhance CLIP's zero-shot capabilities by incorporating new domain knowledge while improving its adaptability and robustness against data distribution shifts. Our approach hinges on three main strategies: prompt grouping with masked attention to optimize CLIP's adaptability while safeguarding its zero-shot capabilities; the incorporation of auxiliary prompts for the seamless integration of new domain insights without disrupting the original model's representation; and an ensemble learning strategy that effectively merges original and new knowledge. Through rigorous experimentation including more challenging cross-dataset transfer evaluations our GPE method redefines the benchmarks for the adaptability and efficiency of vision-language models surpassing existing models across various scenarios.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Donggeun Kim , Yujin Jo , Myungjoo Lee , Taesup Kim

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning Artificial Intelligence > Learning Paradigms > Zero-Shot Learning

Keywords

zero-shot learning domain adaptation vision-language model prompt ensemble contrastive language-image pre-training

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025