Relaxing Binary Constraints in Contrastive Vision-Language Medical Representation Learning
Abstract
By aligning paired image and caption embeddings as input contrastive vision-language representation learning has witnessed significant advances as illustrated by CLIP allowing visual encoders to learn from textual supervision and vice versa. Benefiting from millions of image-caption pairs collected from the Internet CLIP-like models show competitive performances against fully supervised baselines. However the learned visual representations are still undermined due to the binary constraint as most contrastive learning frameworks follow strict one-to-one correspondence for the input pairs of data and optimize the models using the InfoNCE loss function. The embeddings of the paired image-text are aligned while the unpaired image-text are pushed away from each other. In fact there are naturally many "false negatives" among these negative pairs since unpaired data can also have a high similarity. In this work we aim to overcome the impact of false negatives in vision-language representation learning by introducing soft targets for estimating the similarity between unpaired images and texts using external semantic knowledge structured in the form of graphs. The interest of such a method is demonstrated in the application context of medical imaging.