Re-Attention for Visual Question Answering

Wenya Guo; Ying Zhang; Xiaoping Wu; Jufeng Yang; Xiangrui Cai; Xiaojie Yuan

2020 AAAI AAAI 2020

Re-Attention for Visual Question Answering

Abstract

Abstract Visual Question Answering~(VQA) requires a simultaneous understanding of images and questions. Existing methods achieve well performance by focusing on both key objects in images and key words in questions. However, the answer also contains rich information which can help to better describe the image and generate more accurate attention maps. In this paper, to utilize the information in answer, we propose a re-attention framework for the VQA task. We first associate image and question by calculating the similarity of each object-word pairs in the feature space. Then, based on the answer, the learned model re-attends the corresponding visual objects in images and reconstructs the initial attention map to produce consistent results. Benefiting from the re-attention procedure, the question can be better understood, and the satisfactory answer is generated. Extensive experiments on the benchmark dataset demonstrate the proposed method performs favorably against the state-of-the-art approaches.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — image-text similarity

🐣 Hot Topic Early Bird — image understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wenya Guo , Ying Zhang , Xiaoping Wu , Jufeng Yang , Xiangrui Cai , Xiaojie Yuan

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Techniques > Attention Computer Vision > Generation > Visual Question Answering

Keywords

visual question answering attention mechanism multimodal learning multi-modal learning image understanding attention map image-text similarity

Download PDF

Related papers

Enhancing Pointer Network for Sentence Ordering with Pairwise Ordering Predictions 2020

CopyMTL: Copy Mechanism for Joint Extraction of Entities and Relations with Multi-Task Learning 2020

Neural Simile Recognition with Cyclic Multitask Learning and Local Attention 2020

Being Optimistic to Be Conservative: Quickly Learning a CVaR Policy 2020

Multi-Point Semantic Representation for Intent Classification 2020