2021 NAACL NAACL 2021

MIMOQA: Multimodal Input Multimodal Output Question Answering

Abstract

AbstractMultimodal research has picked up significantly in the space of question answering with the task being extended to visual question answering, charts question answering as well as multimodal input question answering. However, all these explorations produce a unimodal textual output as the answer. In this paper, we propose a novel task - MIMOQA - Multimodal Input Multimodal Output Question Answering in which the output is also multimodal. Through human experiments, we empirically show that such multimodal outputs provide better cognitive understanding of the answers. We also propose a novel multimodal question-answering framework, MExBERT, that incorporates a joint textual and visual attention towards producing such a multimodal output. Our method relies on a novel multimodal dataset curated for this problem from publicly available unimodal datasets. We show the superior performance of MExBERT against strong baselines on both the automatic as well as human metrics.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio