2019 IJCAI IJCAI 2019

Dual Visual Attention Network for Visual Dialog

Abstract

Visual dialog is a challenging task, which involves multi-round semantic transformations between vision and language. This paper aims to address cross-modal semantic correlation for visual dialog. Motivated by that Vg (global vision), Vl (local vision), Q (question) and H (history) have inseparable relevances, the paper proposes a novel Dual Visual Attention Network (DVAN) to realize (Vg, Vl, Q, H)--> A. DVAN is a three-stage query-adaptive attention model. In order to acquire accurate A (answer), it first explores the textual attention, which imposes the question on history to pick out related context H'. Then, based on Q and H', it implements respective visual attentions to discover related global image visual hints Vg' and local object-based visual hints Vl'. Next, a dual crossing visual attention is proposed. Vg' and Vl' are mutually embedded to learn the complementary of visual semantics. Finally, the attended textual and visual features are combined to infer the answer. Experimental results on the VisDial v0.9 and v1.0 datasets validate the effectiveness of the proposed approach.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning
🧭 Keyword Pioneer — multi-round dialogue
🐣 Hot Topic Early Bird — cross-modal learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors