LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Siqi Sun; Yen-Chun Chen; Linjie Li; Shuohang Wang; Yuwei Fang; Jingjing Liu

2021 NAACL NAACL 2021

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Abstract

AbstractMultimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computational cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by extracting pre-cached feature indexes offline, and employing instant dot-product matching online, which significantly speeds up retrieval process. In fact, our LightningDOT achieves superior performance across mainstream ITR benchmarks such as Flickr30k and COCO datasets, outperforming existing pre-trained models that consume 1000 times magnitude of computational hours using the same features.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐣 Hot Topic Early Bird — image-text retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Siqi Sun , Yen-Chun Chen , Linjie Li , Shuohang Wang , Yuwei Fang , Jingjing Liu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Embedding Learning Machine Learning > Application Areas > Efficient Computing Deep Learning > Techniques > Pretraining

Keywords

efficient inference visual-semantic embedding image-text retrieval cross-modal attention multimodal pre-training

Download PDF

Related papers

Knowledge Router: Learning Disentangled Representations for Knowledge Graphs 2021

Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks 2021

Abstract Meaning Representation Guided Graph Encoding and Decoding for Joint Information Extraction 2021

Beyond Fair Pay: Ethical Implications of NLP Crowdsourcing 2021

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers 2021