Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Shir Gur; Natalia Neverova; Chris Stauffer; Ser-Nam Lim; Douwe Kiela; Austin Reiter

2021 EMNLP EMNLP 2021

Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Abstract

AbstractRecent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — retrieval-augmented generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shir Gur , Natalia Neverova , Chris Stauffer , Ser-Nam Lim , Douwe Kiela , Austin Reiter

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Applications > Question Answering Machine Learning > Learning Types > Retrieval-Augmented Generation Natural Language Processing > Applications > Visual Question Answering Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Retrieval-Augmented Generation Computer Vision > Generation > Visual Question Answering

Keywords

visual question answering multimodal learning image captioning cross-modal retrieval retrieval-augmented generation retrieval augmentation multi-modal classification alignment model image-caption retrieval

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021