Deep Modular Co-Attention Networks for Visual Question Answering

Zhou Yu; Jun Yu; Yuhao Cui; Dacheng Tao; Qi Tian

2019 CVPR CVPR 2019

Deep Modular Co-Attention Networks for Visual Question Answering

Abstract

Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the question-guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63% overall accuracy on the test-dev set.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

📈 Trend Setter — Attention Mechanism

🧭 Keyword Pioneer — image-text alignment

🐣 Hot Topic Early Bird — image-text alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhou Yu , Jun Yu , Yuhao Cui , Dacheng Tao , Qi Tian

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Applications > Question Answering Computer Vision > Analysis > Video Understanding Natural Language Processing > Applications > Visual Question Answering Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Techniques > Attention Mechanism Computer Vision > Analysis > Visual Question Answering

Keywords

visual question answering multimodal learning deep learning image-text alignment co-attention mechanism modular attention network modular attention

Download PDF

Related papers

Fast Single Image Reflection Suppression via Convex Optimization 2019

Learning Video Representations From Correspondence Proposals 2019

ATOM: Accurate Tracking by Overlap Maximization 2019

Visual Tracking via Adaptive Spatially-Regularized Correlation Filters 2019

Edge-Labeling Graph Neural Network for Few-Shot Learning 2019