2025 WACV WACV 2025

Relaxing Binary Constraints in Contrastive Vision-Language Medical Representation Learning

Abstract

By aligning paired image and caption embeddings as input contrastive vision-language representation learning has witnessed significant advances as illustrated by CLIP allowing visual encoders to learn from textual supervision and vice versa. Benefiting from millions of image-caption pairs collected from the Internet CLIP-like models show competitive performances against fully supervised baselines. However the learned visual representations are still undermined due to the binary constraint as most contrastive learning frameworks follow strict one-to-one correspondence for the input pairs of data and optimize the models using the InfoNCE loss function. The embeddings of the paired image-text are aligned while the unpaired image-text are pushed away from each other. In fact there are naturally many "false negatives" among these negative pairs since unpaired data can also have a high similarity. In this work we aim to overcome the impact of false negatives in vision-language representation learning by introducing soft targets for estimating the similarity between unpaired images and texts using external semantic knowledge structured in the form of graphs. The interest of such a method is demonstrated in the application context of medical imaging.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio