Graph-Structured Representations for Visual Question Answering

Damien Teney; Lingqiao Liu; Anton van den Hengel

2017 CVPR CVPR 2017

Graph-Structured Representations for Visual Question Answering

Abstract

This paper proposes to improve visual question answering (VQA) with structured representations of both scene contents and questions. A key challenge in VQA is to require joint reasoning over the visual and text domains. The predominant CNN/LSTM-based approach to VQA is limited by monolithic vector representations that largely ignore structure in the scene and in the question. CNN feature vectors cannot effectively capture situations as simple as multiple object instances, and LSTMs process questions as series of words, which do not reflect the true complexity of language structure. We instead propose to build graphs over the scene objects and over the question words, and we describe a deep neural network that exploits the structure in these representations. We show that this approach achieves significant improvements over the state-of-the-art, increasing accuracy from 71.2% to 74.4% in accuracy on the "abstract scenes" multiple-choice benchmark, and from 34.7% to 39.1% in accuracy over pairs of "balanced" scenes, i.e. images with fine-grained differences and opposite yes/no answers to a same question.

🌱 Topic Pioneer — Visual Question Answering

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

📈 Trend Setter — Visual Question Answering

🧭 Keyword Pioneer — multimodal reasoning

🐣 Hot Topic Early Bird — multimodal reasoning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Damien Teney , Lingqiao Liu , Anton van den Hengel

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Graph Neural Networks Natural Language Processing > Applications > Visual Question Answering Deep Learning > Learning Types > Multi-Modal Learning Computer Vision > Applications > Visual Question Answering Artificial Intelligence > Core AI > Visual Question Answering

Keywords

scene understanding visual question answering multimodal learning natural language understanding multimodal reasoning structured representation graph neural network

Download PDF

Related papers

Deep Outdoor Illumination Estimation 2017

SRN: Side-output Residual Network for Object Symmetry Detection in the Wild 2017

Weakly Supervised Semantic Segmentation Using Web-Crawled Videos 2017

FASON: First and Second Order Information Fusion Network for Texture Recognition 2017

Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization 2017