BMIP: Bi-directional Modality Interaction Prompt Learning for VLM

Song-Lin Lv; Yu-Yang Chen; Zhi Zhou; Ming Yang; Lan-Zhe Guo

2025 IJCAI IJCAI 2025

BMIP: Bi-directional Modality Interaction Prompt Learning for VLM

Abstract

Vision-language models (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called Bi-directional Modality Interaction Prompt (BMIP), which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Song-Lin Lv , Yu-Yang Chen , Zhi Zhou , Ming Yang , Lan-Zhe Guo

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Transfer Learning Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

domain generalization transfer learning attention mechanism multi-modal learning prompt learning vision language model cross-modal alignment foundation model vision-language model modality interaction

Download PDF

Related papers

Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain 2025

Responsibility Anticipation and Attribution in LTLf 2025

Argument-based Multi-Issue Negotiation 2025

Online Resource Sharing: Better Robust Guarantees via Randomized Strategies 2025

Equitable Mechanism Design for Facility Location 2025