Multi-Modality Cross Attention Network for Image and Sentence Matching

Xi Wei; Tianzhu Zhang; Yan Li; Yongdong Zhang; Feng Wu

2020 CVPR CVPR 2020

Multi-Modality Cross Attention Network for Image and Sentence Matching

Abstract

The key of image and sentence matching is to accurately measure the visual-semantic similarity between an image and a sentence. However, most existing methods make use of only the intra-modality relationship within each modality or the inter-modality relationship between image regions and sentence words for the cross-modal matching task. Different from them, in this work, we propose a novel MultiModality Cross Attention (MMCA) Network for image and sentence matching by jointly modeling the intra-modality and inter-modality relationships of image regions and sentence words in a unified deep model. In the proposed MMCA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — intra-modality relationship

🐣 Hot Topic Early Bird — cross-attention mechanism

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xi Wei , Tianzhu Zhang , Yan Li , Yongdong Zhang , Feng Wu

Topics

Machine Learning > Core Methods > Metric Learning Natural Language Processing > Applications > Information Retrieval Computer Vision > Core AI > Multimodal Learning Machine Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Language Artificial Intelligence > Core AI > Multi-Modal Learning Computer Vision > Analysis > Computer Vision Artificial Intelligence > Core AI > Attention

Keywords

attention mechanism multimodal learning multi-modality learning cross-modal retrieval cross-attention mechanism image-text matching visual-semantic similarity image-sentence matching intra-modality relationship inter-modality relationship

Download PDF

Related papers

Deep Polarization Cues for Transparent Object Segmentation 2020

HRank: Filter Pruning Using High-Rank Feature Map 2020

Panoptic-Based Image Synthesis 2020

Select, Supplement and Focus for RGB-D Saliency Detection 2020

ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings 2020