MIMOQA: Multimodal Input Multimodal Output Question Answering

Hrituraj Singh; Anshul Nasery; Denil Mehta; Aishwarya Agarwal; Jatin Lamba; Balaji Vasan Srinivasan

2021 NAACL NAACL 2021

MIMOQA: Multimodal Input Multimodal Output Question Answering

Abstract

AbstractMultimodal research has picked up significantly in the space of question answering with the task being extended to visual question answering, charts question answering as well as multimodal input question answering. However, all these explorations produce a unimodal textual output as the answer. In this paper, we propose a novel task - MIMOQA - Multimodal Input Multimodal Output Question Answering in which the output is also multimodal. Through human experiments, we empirically show that such multimodal outputs provide better cognitive understanding of the answers. We also propose a novel multimodal question-answering framework, MExBERT, that incorporates a joint textual and visual attention towards producing such a multimodal output. Our method relies on a novel multimodal dataset curated for this problem from publicly available unimodal datasets. We show the superior performance of MExBERT against strong baselines on both the automatic as well as human metrics.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hrituraj Singh , Anshul Nasery , Denil Mehta , Aishwarya Agarwal , Jatin Lamba , Balaji Vasan Srinivasan

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Applications > Visual Question Answering Computer Vision > Analysis > Visual Question Answering

Keywords

visual question answering multimodal learning joint attention multimodal output

Download PDF

Related papers

Knowledge Router: Learning Disentangled Representations for Knowledge Graphs 2021

Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks 2021

Abstract Meaning Representation Guided Graph Encoding and Decoding for Joint Information Extraction 2021

Beyond Fair Pay: Ethical Implications of NLP Crowdsourcing 2021

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers 2021