Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

Fengyin Lin; Mingkang Li; Da Li; Timothy Hospedales; Yi-Zhe Song; Yonggang Qi

2023 CVPR CVPR 2023

Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

Abstract

This paper studies the problem of zero-short sketch-based image retrieval (ZS-SBIR), however with two significant differentiators to prior art (i) we tackle all variants (inter-category, intra-category, and cross datasets) of ZS-SBIR with just one network ("everything"), and (ii) we would really like to understand how this sketch-photo matching operates ("explainable"). Our key innovation lies with the realization that such a cross-modal matching problem could be reduced to comparisons of groups of key local patches -- akin to the seasoned "bag-of-words" paradigm. Just with this change, we are able to achieve both of the aforementioned goals, with the added benefit of no longer requiring external semantic knowledge. Technically, ours is a transformer-based cross-modal network, with three novel components (i) a self-attention module with a learnable tokenizer to produce visual tokens that correspond to the most informative local regions, (ii) a cross-attention module to compute local correspondences between the visual tokens across two modalities, and finally (iii) a kernel-based relation network to assemble local putative matches and produce an overall similarity metric for a sketch-photo pair. Experiments show ours indeed delivers superior performances across all ZS-SBIR settings. The all important explainable goal is elegantly achieved by visualizing cross-modal token correspondences, and for the first time, via sketch to photo synthesis by universal replacement of all matched photo patches.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — local patch correspondence

🐣 Hot Topic Early Bird — visual token

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Fengyin Lin , Mingkang Li , Da Li , Timothy Hospedales , Yi-Zhe Song , Yonggang Qi

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Learning Types > Zero-Shot Learning Computer Vision > Core AI > Multimodal Learning Machine Learning > Learning Paradigms > Zero-Shot Learning Deep Learning > Learning Types > Zero-Shot Learning Artificial Intelligence > Core AI > Multi-Modal Learning Computer Vision > Generation > Image Retrieval

Keywords

zero-shot learning explainable ai visual token cross-modal matching sketch-based image retrieval local patch correspondence kernel-based relation network

Download PDF

Related papers

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching 2023

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars 2023

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos 2023

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement 2023

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata 2023