GRAVL-BERT: Graphical Visual-Linguistic Representations for Multimodal Coreference Resolution

Danfeng Guo; Arpit Gupta; Sanchit Agarwal; Jiun-Yu Kao; Shuyang Gao; Arijit Biswas; Chien-Wei Lin; Tagyoung Chung; Mohit Bansal

2022 COLING COLING 2022

GRAVL-BERT: Graphical Visual-Linguistic Representations for Multimodal Coreference Resolution

Abstract

AbstractLearning from multimodal data has become a popular research topic in recent years. Multimodal coreference resolution (MCR) is an important task in this area. MCR involves resolving the references across different modalities, e.g., text and images, which is a crucial capability for building next-generation conversational agents. MCR is challenging as it requires encoding information from different modalities and modeling associations between them. Although significant progress has been made for visual-linguistic tasks such as visual grounding, most of the current works involve single turn utterances and focus on simple coreference resolutions. In this work, we propose an MCR model that resolves coreferences made in multi-turn dialogues with scene images. We present GRAVL-BERT, a unified MCR framework which combines visual relationships between objects, background scenes, dialogue, and metadata by integrating Graph Neural Networks with VL-BERT. We present results on the SIMMC 2.0 multimodal conversational dataset, achieving the rank-1 on the DSTC-10 SIMMC 2.0 MCR challenge with F1 score 0.783. Our code is available at https://github.com/alexa/gravl-bert.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — multimodal coreference resolution

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Danfeng Guo , Arpit Gupta , Sanchit Agarwal , Jiun-Yu Kao , Shuyang Gao , Arijit Biswas , Chien-Wei Lin , Tagyoung Chung , Mohit Bansal

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers Deep Learning > Architectures > Graph Neural Networks Natural Language Processing > Understanding > Coreference Resolution Natural Language Processing > Applications > Dialogue Systems Computer Vision > Core AI > Multimodal Learning Deep Learning > Models > Transformers

Keywords

scene understanding multimodal learning visual-linguistic representation visual grounding coreference resolution graph neural network multimodal coreference resolution

Download PDF

Related papers

MulZDG: Multilingual Code-Switching Framework for Zero-shot Dialogue Generation 2022

The Role of Context and Uncertainty in Shallow Discourse Parsing 2022

SelfMix: Robust Learning against Textual Label Noise with Self-Mixup Training 2022

Complicate Then Simplify: A Novel Way to Explore Pre-trained Models for Text Classification 2022

Repo4QA: Answering Coding Questions via Dense Retrieval on GitHub Repositories 2022