Learning to Embed Multi-Modal Contexts for Situated Conversational Agents

Haeju Lee; Oh Joon Kwon; Yunseon Choi; Minho Park; Ran Han; Yoonhyung Kim; Jinhyeon Kim; Youngjune Lee; Haebin Shin; Kangwook Lee; Kee-eung Kim

2022 NAACL NAACL 2022

Learning to Embed Multi-Modal Contexts for Situated Conversational Agents

Abstract

AbstractThe Situated Interactive Multi-Modal Conversations (SIMMC) 2.0 aims to create virtual shopping assistants that can accept complex multi-modal inputs, i.e. visual appearances of objects and user utterances. It consists of four subtasks, multi-modal disambiguation (MM-Disamb), multi-modal coreference resolution (MM-Coref), multi-modal dialog state tracking (MM-DST), and response retrieval and generation. While many task-oriented dialog systems usually tackle each subtask separately, we propose a jointly learned multi-modal encoder-decoder that incorporates visual inputs and performs all four subtasks at once for efficiency. This approach won the MM-Coref and response retrieval subtasks and nominated runner-up for the remaining subtasks using a single unified model at the 10th Dialog Systems Technology Challenge (DSTC10), setting a high bar for the novel task of multi-modal task-oriented dialog systems.

🧭 Keyword Pioneer — multi-modal coreference

🐣 Hot Topic Early Bird — conversational agent

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Haeju Lee , Oh Joon Kwon , Yunseon Choi , Minho Park , Ran Han , Yoonhyung Kim , Jinhyeon Kim , Youngjune Lee , Haebin Shin , Kangwook Lee , Kee-eung Kim

Topics

Natural Language Processing > Generation > Dialogue Systems

Keywords

coreference resolution conversational agent task-oriented dialog dialog state tracking multi-modal dialogue multi-modal coreference

Download PDF

Generating Complement Data for Aspect Term Extraction with GPT-2 2022

Regularized Training of Nearest Neighbor Language Models 2022

Systematicity Emerges in Transformers when Abstract Grammatical Roles Guide Attention 2022

Neural Retriever and Go Beyond: A Thesis Proposal 2022

Learning to Embed Multi-Modal Contexts for Situated Conversational Agents

Abstract

Authors

Topics

Keywords

Related papers