GuessWhat?! Visual Object Discovery Through Multi-Modal Dialogue

Harm de Vries; Florian Strub; Sarath Chandar; Olivier Pietquin; Hugo Larochelle; Aaron Courville

2017 CVPR CVPR 2017

GuessWhat?! Visual Object Discovery Through Multi-Modal Dialogue

Abstract

We introduce GuessWhat?!, a two-player guessing game as a testbed for research on the interplay of computer vision and dialogue systems. The goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. Higher-level image understanding, like spatial reasoning and language grounding, is required to solve the proposed task. Our key contribution is the collection of a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images. We explain our design decisions in collecting the dataset and introduce the oracle and questioner tasks that are associated with the two players of the game. We prototyped deep learning models to establish initial baselines of the introduced tasks.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

📈 Trend Setter — Question Answering

🧭 Keyword Pioneer — multi-modal dialogue

🐣 Hot Topic Early Bird — spatial reasoning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Harm de Vries , Florian Strub , Sarath Chandar , Olivier Pietquin , Hugo Larochelle , Aaron Courville

Topics

Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Analysis > Scene Understanding Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Dialogue Systems Computer Vision > Applications > Question Answering

Keywords

visual question answering language grounding object discovery spatial reasoning dialogue system multi-modal dialogue guessing game

Download PDF

Related papers

Deep Outdoor Illumination Estimation 2017

SRN: Side-output Residual Network for Object Symmetry Detection in the Wild 2017

Weakly Supervised Semantic Segmentation Using Web-Crawled Videos 2017

FASON: First and Second Order Information Fusion Network for Texture Recognition 2017

Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization 2017