Situated and Interactive Multimodal Conversations

Seungwhan Moon; Satwik Kottur; Paul Crook; Ankita De; Shivani Poddar; Theodore Levin; David Whitney; Daniel Difranco; Ahmad Beirami; Eunjoon Cho; Rajen Subba; Alborz Geramifard

2020 COLING COLING 2020

Situated and Interactive Multimodal Conversations

Abstract

AbstractNext generation virtual assistants are envisioned to handle multimodal inputs (e.g., vision, memories of previous interactions, and the user’s utterances), and perform multimodal actions (, displaying a route while generating the system’s utterance). We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents that take multimodal actions grounded in a co-evolving multimodal input context in addition to the dialog history. We provide two SIMMC datasets totalling ~13K human-human dialogs (~169K utterances) collected using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture – grounded in a shared virtual environment; and (b) fashion – grounded in an evolving set of images. Datasets include multimodal context of the items appearing in each scene, and contextual NLU, NLG and coreference annotations using a novel and unified framework of SIMMC conversational acts for both user and assistant utterances. Finally, we present several tasks within SIMMC as objective evaluation protocols, such as structural API prediction, response generation, and dialog state tracking. We benchmark a collection of existing models on these SIMMC tasks as strong baselines, and demonstrate rich multimodal conversational interactions. Our data, annotations, and models will be made publicly available.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Seungwhan Moon , Satwik Kottur , Paul Crook , Ankita De , Shivani Poddar , Theodore Levin , David Whitney , Daniel Difranco , Ahmad Beirami , Eunjoon Cho , Rajen Subba , Alborz Geramifard

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Generation > Dialogue Systems

Keywords

coreference resolution multimodal interaction dialog state tracking multimodal dialogue system

Download PDF

Related papers

Persuasiveness of News Editorials depending on Ideology and Personality 2020

A Graph Representation of Semi-structured Data for Web Question Answering 2020

Span-based Joint Entity and Relation Extraction with Attention-based Span-specific and Contextual Semantic Representations 2020

Hierarchical Chinese Legal event extraction via Pedal Attention Mechanism 2020

End-to-End Emotion-Cause Pair Extraction with Graph Convolutional Network 2020