Discrete-Continuous Action Space Policy Gradient-Based Attention for Image-Text Matching

Shiyang Yan; Li Yu; Yuan Xie

2021 CVPR CVPR 2021

Discrete-Continuous Action Space Policy Gradient-Based Attention for Image-Text Matching

Abstract

Image-text matching is an important multi-modal task with massive applications. It tries to match the image and the text with similar semantic information. Existing approaches do not explicitly transform the different modalities into a common space. Meanwhile, the attention mechanism which is widely used in image-text matching models does not have supervision. We propose a novel attention scheme which projects the image and text embedding into a common space and optimises the attention weights directly towards the evaluation metrics. The proposed attention scheme can be considered as a kind of supervised attention and requiring no additional annotations. It is trained via a novel Discrete-continuous action space policy gradient algorithm, which is more effective in modelling complex action space than previous continuous action space policy gradient. We evaluate the proposed methods on two widely-used benchmark datasets: Flickr30k and MS-COCO, outperforming the previous approaches by a large margin.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — discrete-continuous action space

🐣 Hot Topic Early Bird — image-text matching

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shiyang Yan , Li Yu , Yuan Xie

Topics

Artificial Intelligence > Core AI > Multimodal Learning Reinforcement Learning > Methods > Policy Learning Machine Learning > Learning Types > Reinforcement Learning Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Techniques > Attention

Keywords

policy gradient attention mechanism multi-modal learning cross-modal retrieval image-text matching discrete-continuous action space

Download PDF

Related papers

Learning To Reconstruct High Speed and High Dynamic Range Videos From Events 2021

DeFLOCNet: Deep Image Editing via Flexible Low-Level Controls 2021

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs 2021

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization 2021

Pose-Guided Human Animation From a Single Image in the Wild 2021