SCOT: Self-Supervised Contrastive Pretraining for Zero-Shot Compositional Retrieval

Bhavin Jawade; João V. B. Soares; Kapil Thadani; Deen Dayal Mohan; Amir Erfan Eshratifar; Benjamin Culpepper; Paloma de Juan; Srirangaraj Setlur; Venu Govindaraju

2025 WACV WACV 2025

SCOT: Self-Supervised Contrastive Pretraining for Zero-Shot Compositional Retrieval

Abstract

Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work we propose SCOT (Self-supervised COmpositional Training) a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining replacing the target image embedding. In zero-shot settings this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — zero-shot compositional retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Bhavin Jawade , João V. B. Soares , Kapil Thadani , Deen Dayal Mohan , Amir Erfan Eshratifar , Benjamin Culpepper , Paloma de Juan , Srirangaraj Setlur , Venu Govindaraju

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Self-Supervised Learning Artificial Intelligence > Learning Paradigms > Zero-Shot Learning

Keywords

vision-language model contrastive pretraining self-supervised contrastive learning compositional image retrieval zero-shot compositional retrieval embedding composition

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025