MAR: Matching-Augmented Reasoning for Enhancing Visual-based Entity Question Answering

Zhengxuan Zhang; Yin Wu; Yuyu Luo; Nan Tang

2024 EMNLP EMNLP 2024

MAR: Matching-Augmented Reasoning for Enhancing Visual-based Entity Question Answering

Abstract

AbstractA multimodal large language model MLLMs may struggle with answering visual-based (personal) entity questions (VEQA), such as ”who is A?” or ”who is A that B is talking to?” for various reasons, e.g., the absence of the name of A in the caption or the inability of MLLMs to recognize A, particularly for less common entities. Furthermore, even if the MLLMs can identify A, it may refrain from answering due to privacy concerns. In this paper, we introduce a novel method called Matching-Augmented Reasoning (MAR) to enhance VEQA. Given a collection of visual objects with captions, MAR preprocesses each object individually, identifying faces, names, and their alignments within the object. It encodes this information and stores their vector representations in vector databases. When handling VEQA, MAR retrieves matching faces and names and organizes these entities into a matching graph. MAR then derives the answer to the query by reasoning over this matching graph. Extensive experiments show that MAR significantly improves VEQA compared with the state-of-the-art methods using MLLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Natural Language Processing

🧭 Keyword Pioneer — entity question

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhengxuan Zhang , Yin Wu , Yuyu Luo , Nan Tang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Applications > Question Answering Artificial Intelligence > Core AI > Large Language Models Computer Vision > Core AI > Multimodal Learning

Keywords

visual question answering multimodal large language model graph reasoning multimodal reasoning knowledge retrieval entity recognition vector database entity question matching graph

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024