Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding

Tal Shaharabany; Lior Wolf

2023 CVPR CVPR 2023

Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding

Abstract

A phrase grounding model receives an input image and a text phrase and outputs a suitable localization map. We present an effective way to refine a phrase ground model by considering self-similarity maps extracted from the latent representation of the model's image encoder. Our main insights are that these maps resemble localization maps and that by combining such maps, one can obtain useful pseudo-labels for performing self-training. Our results surpass, by a large margin, the state-of-the-art in weakly supervised phrase grounding. A similar gap in performance is obtained for a recently proposed downstream task called WWbL, in which the input image is given without any text. Our code is available as supplementary.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — similarity map

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tal Shaharabany , Lior Wolf

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Weakly Supervised Learning Natural Language Processing > Applications > Information Extraction Deep Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Paradigms > Weakly Supervised Learning Computer Vision > Analysis > Visual Question Answering

Keywords

weakly supervised learning phrase grounding image encoder similarity map

Download PDF

Related papers

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching 2023

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars 2023

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos 2023

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement 2023

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata 2023