Inflate and Shrink:Enriching and Reducing Interactions for Fast Text-Image Retrieval

Haoliang Liu; Tan Yu; Ping Li

2021 EMNLP EMNLP 2021

Inflate and Shrink:Enriching and Reducing Interactions for Fast Text-Image Retrieval

Abstract

AbstractBy exploiting the cross-modal attention, cross-BERT methods have achieved state-of-the-art accuracy in cross-modal retrieval. Nevertheless, the heavy text-image interactions in the cross-BERT model are prohibitively slow for large-scale retrieval. Late-interaction methods trade off retrieval accuracy and efficiency by exploiting cross-modal interaction only in the late stage, attaining a satisfactory retrieval speed. In this work, we propose an inflating and shrinking approach to further boost the efficiency and accuracy of late-interaction methods. The inflating operation plugs several codes in the input of the encoder to exploit the text-image interactions more thoroughly for higher retrieval accuracy. Then the shrinking operation gradually reduces the text-image interactions through knowledge distilling for higher efficiency. Through an inflating operation followed by a shrinking operation, both efficiency and accuracy of a late-interaction model are boosted. Systematic experiments on public benchmarks demonstrate the effectiveness of our inflating and shrinking approach.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — text-image retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Haoliang Liu , Tan Yu , Ping Li

Topics

Machine Learning > Application Areas > Efficient Computing Machine Learning > Application Areas > Knowledge Distillation Computer Vision > Domain-Specific > Autonomous Driving Natural Language Processing > Applications > Information Retrieval Computer Vision > Core AI > Multimodal Learning Deep Learning > Techniques > Knowledge Distillation Deep Learning > Learning Types > Multi-Modal Learning Computer Vision > Generation > Image Retrieval

Keywords

model compression knowledge distillation cross-modal retrieval efficient computing late interaction text-image retrieval text-image matching

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021