Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding

Wei Li; Zhen Huang; Xinmei Tian; Le Lu; Houqiang Li; Xu Shen; Jieping Ye

2024 EMNLP EMNLP 2024

Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding

Abstract

AbstractContrastively trained vision-language models such as CLIP have achieved remarkable progress in vision and language representation learning. Despite the promising progress, their proficiency in compositional reasoning over attributes and relations (e.g., distinguishing between “the car is underneath the person” and “the person is underneath the car”) remains notably inadequate. We investigate the cause for this deficient behavior is the composition attribution issue, where the attribution scores (e.g., attention scores or GradCAM scores) for relations (e.g., underneath) or attributes (e.g., red) in the text are substantially lower than those for object terms. In this work, we show such issue is mitigated via a novel framework called CAE (Composition Attribution Enhancement). This generic framework incorporates various interpretable attribution methods to encourage the model to pay greater attention to composition words denoting relationships and attributes within the text. Detailed analysis shows that our approach enables the models to adjust and rectify the attribution of the texts. Extensive experiments across seven benchmarks reveal that our framework significantly enhances the ability to discern intricate details and construct more sophisticated interpretations of combined visual and linguistic elements.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — interpretable attribution

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wei Li , Zhen Huang , Xinmei Tian , Le Lu , Houqiang Li , Xu Shen , Jieping Ye

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Contrastive Learning Computer Vision > Core AI > Multimodal Learning Deep Learning > Techniques > Attention

Keywords

contrastive learning attention mechanism vision-language model compositional reasoning attribution method attention score interpretable attribution

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024