Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Le Zhang; Rabiul Awal; Aishwarya Agrawal

2024 CVPR CVPR 2024

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Abstract

Vision-Language Models (VLMs) such as CLIP exhibit strong image-text comprehension abilities facilitating advances in several downstream tasks such as zero-shot image classification image-text retrieval and text-to-image generation. However the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally the current contrastive learning objective fails to focus on fine-grained grounding components like relations actions and attributes resulting in "bag-of-words" representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐣 Hot Topic Early Bird — hard negative mining

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Le Zhang , Rabiul Awal , Aishwarya Agrawal

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Contrastive Learning Deep Learning > Architectures > Transformers Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Contrastive Learning Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Models > Vision-Language Models Artificial Intelligence > Core AI > Vision-Language Models

Keywords

contrastive learning vision language model vision-language model image-text retrieval image-text alignment zero-shot classification compositional reasoning hard negative mining hard negative

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024