Audio Visual Scene-Aware Dialog

Huda Alamri; Vincent Cartillier; Abhishek Das; Jue Wang; Anoop Cherian; Irfan Essa; Dhruv Batra; Tim K. Marks; Chiori Hori; Peter Anderson; Stefan Lee; Devi Parikh

2019 CVPR CVPR 2019

Audio Visual Scene-Aware Dialog

Abstract

We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — multimodal dialog system

🐣 Hot Topic Early Bird — audio-visual learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Huda Alamri , Vincent Cartillier , Abhishek Das , Jue Wang , Anoop Cherian , Irfan Essa , Dhruv Batra , Tim K. Marks , Chiori Hori , Peter Anderson , Stefan Lee , Devi Parikh

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Generation > Image Captioning Computer Vision > Processing > Video Understanding Natural Language Processing > Applications > Dialogue Systems Artificial Intelligence > Core AI > Language Deep Learning > Learning Types > Multi-Modal Learning Computer Vision > Generation > Visual Question Answering

Keywords

visual question answering multimodal learning audio-visual learning video understanding visual dialog language model sequence-to-sequence model dialogue system dialog generation multimodal dialog system

Download PDF

Related papers

Fast Single Image Reflection Suppression via Convex Optimization 2019

Learning Video Representations From Correspondence Proposals 2019

ATOM: Accurate Tracking by Overlap Maximization 2019

Visual Tracking via Adaptive Spatially-Regularized Correlation Filters 2019

Edge-Labeling Graph Neural Network for Few-Shot Learning 2019