Cross-Modal Coherence for Text-to-Image Retrieval

Malihe Alikhani; Fangda Han; Hareesh Ravi; Mubbasir Kapadia; Vladimir Pavlovic; Matthew Stone

2022 AAAI AAAI 2022

Cross-Modal Coherence for Text-to-Image Retrieval

Abstract

Abstract Common image-text joint understanding techniques presume that images and the associated text can universally be characterized by a single implicit model. However, co-occurring images and text can be related in qualitatively different ways, and explicitly modeling it could improve the performance of current joint understanding models. In this paper, we train a Cross-Modal Coherence Model for text-to-image retrieval task. Our analysis shows that models trained with image–text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models. We also show via human evaluation that images retrieved by the proposed coherence-aware model are preferred over a coherence-agnostic baseline by a huge margin. Our findings provide insights into the ways that different modalities communicate and the role of coherence relations in capturing commonsense inferences in text and imagery.

🌉 Interdisciplinary Bridge — Computer Science and Data Science & Analytics and Machine Learning

🧭 Keyword Pioneer — cross-modal coherence

🐣 Hot Topic Early Bird — image-text alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Malihe Alikhani , Fangda Han , Hareesh Ravi , Mubbasir Kapadia , Vladimir Pavlovic , Matthew Stone

Topics

Machine Learning > Core Methods > Metric Learning Computer Science > Applications > Information Retrieval Data Science & Analytics > Applications > Information Retrieval Machine Learning > Learning Types > Multi-Modal Learning

Keywords

multimodal learning cross-modal learning image-text alignment text-to-image retrieval image-text matching coherence modeling cross-modal coherence joint understanding

Download PDF

Related papers

Dynamic Spatial Propagation Network for Depth Completion 2022

FedFR: Joint Optimization Federated Framework for Generic and Personalized Face Recognition 2022

Memory-Guided Semantic Learning Network for Temporal Sentence Grounding 2022

AnchorFace: Boosting TAR@FAR for Practical Face Recognition 2022

Parallel and High-Fidelity Text-to-Lip Generation 2022