Concept-Guided Prompt Learning for Generalization in Vision-Language Models

Yi Zhang; Ce Zhang; Ke Yu; Yushun Tang; Zhihai He

2024 AAAI AAAI 2024

Concept-Guided Prompt Learning for Generalization in Vision-Language Models

Abstract

Abstract Contrastive Language-Image Pretraining (CLIP) model has exhibited remarkable efficacy in establishing cross-modal connections between texts and images, yielding impressive performance across a broad spectrum of downstream applications through fine-tuning. However, for generalization tasks, the current fine-tuning methods for CLIP, such as CoOp and CoCoOp, demonstrate relatively low performance on some fine-grained datasets. We recognize the underlying reason is that these previous methods only projected global features into the prompt, neglecting the various visual concepts, such as colors, shapes, and sizes, which are naturally transferable across domains and play a crucial role in generalization tasks. To address this issue, in this work, we propose Concept-Guided Prompt Learning (CPL) for vision-language models. Specifically, we leverage the well-learned knowledge of CLIP to create a visual concept cache to enable conceptguided prompting. In order to refine the text features, we further develop a projector that transforms multi-level visual features into text features. We observe that this concept-guided prompt learning approach is able to achieve enhanced consistency between visual and linguistic modalities. Extensive experimental results demonstrate that our CPL method significantly improves generalization capabilities compared to the current state-of-the-art methods.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — concept-guided prompting

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yi Zhang , Ce Zhang , Ke Yu , Yushun Tang , Zhihai He

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning Machine Learning > Application Areas > Domain Generalization Machine Learning > Learning Paradigms > Transfer Learning Deep Learning > Learning Types > Few-Shot Learning Deep Learning > Models > Vision-Language Models

Keywords

few-shot learning domain generalization transfer learning prompt learning concept learning vision-language model concept-guided prompting

Download PDF

Related papers

Goal Alignment: Re-analyzing Value Alignment Problems Using Human-Aware AI 2024

Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables 2024

Suppressing Uncertainty in Gaze Estimation 2024

Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation 2024

Heterogeneous Test-Time Training for Multi-Modal Person Re-identification 2024