Context-Aware Attention Network for Image-Text Retrieval

Qi Zhang; Zhen Lei; Zhaoxiang Zhang; Stan Z. Li

2020 CVPR CVPR 2020

Context-Aware Attention Network for Image-Text Retrieval

Abstract

As a typical cross-modal problem, image-text bi-directional retrieval relies heavily on the joint embedding learning and similarity measure for each image-text pair. It remains challenging because prior works seldom explore semantic correspondences between modalities and semantic correlations in a single modality at the same time. In this work, we propose a unified Context-Aware Attention Network (CAAN), which selectively focuses on critical local fragments (regions and words) by aggregating the global context. Specifically, it simultaneously utilizes global inter-modal alignments and intra-modal correlations to discover latent semantic relations. Considering the interactions between images and sentences in the retrieval process, intra-modal correlations are derived from the second-order attention of region-word alignments instead of intuitively comparing the distance between original features. Our method achieves fairly competitive results on two generic image-text retrieval datasets Flickr30K and MS-COCO.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — bi-directional retrieval

🐣 Hot Topic Early Bird — semantic alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Qi Zhang , Zhen Lei , Zhaoxiang Zhang , Stan Z. Li

Topics

Machine Learning > Core Methods > Metric Learning Machine Learning > Core Methods > Embedding Learning Natural Language Processing > Applications > Information Retrieval Computer Vision > Core AI > Multimodal Learning Computer Vision > Analysis > Image Retrieval Deep Learning > Techniques > Attention Mechanism

Keywords

metric learning attention mechanism cross-modal retrieval semantic alignment image-text retrieval cross-modal embedding semantic correspondence similarity measure image-text matching joint embedding attention network bi-directional retrieval

Download PDF

Related papers

Deep Polarization Cues for Transparent Object Segmentation 2020

HRank: Filter Pruning Using High-Rank Feature Map 2020

Panoptic-Based Image Synthesis 2020

Select, Supplement and Focus for RGB-D Saliency Detection 2020

ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings 2020