2022
ACL
ACL 2022
Scene-Text Aware Image and Text Retrieval with Dual-Encoder
Abstract
AbstractWe tackle the tasks of image and text retrieval using a dual-encoder model in which images and text are encoded independently. This model has attracted attention as an approach that enables efficient offline inferences by connecting both vision and language in the same semantic space; however, whether an image encoder as part of a dual-encoder model can interpret scene-text (i.e., the textual information in images) is unclear. We propose pre-training methods that encourage a joint understanding of the scene-text and surrounding visual information. The experimental results demonstrate that our methods improve the retrieval performances of the dual-encoder models.
🌉
Interdisciplinary Bridge
— Computer Science and Computer Vision and Deep Learning and Machine Learning
🧭
Keyword Pioneer
— scene-text understanding
🐣
Hot Topic Early Bird
— vision-language alignment
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
Authors
Topics
Machine Learning > Core Methods > Embedding Learning
Deep Learning > Architectures > Transformers
Computer Vision > Generation > Image Captioning
Computer Science > Applications > Information Retrieval
Computer Vision > Core AI > Multimodal Learning
Deep Learning > Learning Types > Multi-Modal Learning
Computer Vision > Generation > Image Retrieval
Deep Learning > Models > Vision-Language Models