A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses

Hisashi Kamezawa; Noriki Nishida; Nobuyuki Shimizu; Takashi Miyazaki; Hideki Nakayama

2020 EMNLP EMNLP 2020

A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses

Abstract

AbstractIn real-world dialogue, first-person visual information about where the other speakers are and what they are paying attention to is crucial to understand their intentions. Non-verbal responses also play an important role in social interactions. In this paper, we propose a visually-grounded first-person dialogue (VFD) dataset with verbal and non-verbal responses. The VFD dataset provides manually annotated (1) first-person images of agents, (2) utterances of human speakers, (3) eye-gaze locations of the speakers, and (4) the agents’ verbal and non-verbal responses. We present experimental results obtained using the proposed VFD dataset and recent neural network models (e.g., BERT, ResNet). The results demonstrate that first-person vision helps neural network models correctly understand human intentions, and the production of non-verbal responses is a challenging task like that of verbal responses. Our dataset is publicly available.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — non-verbal response

🐣 Hot Topic Early Bird — egocentric vision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hisashi Kamezawa , Noriki Nishida , Nobuyuki Shimizu , Takashi Miyazaki , Hideki Nakayama

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Neural Networks Computer Vision > Processing > Video Understanding Computer Vision > Domain-Specific > Egocentric Vision Natural Language Processing > Applications > Dialogue Systems Computer Vision > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Multi-Modal Learning Artificial Intelligence > Core AI > Dialogue Systems

Keywords

multimodal learning egocentric vision visual grounding eye gaze tracking dialogue system multimodal dialogue gaze tracking first-person vision non-verbal response visually-grounded dialogue eye-gaze tracking

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020