An Unsupervised Sampling Approach for Image-Sentence Matching Using Document-level Structural Information

Zejun Li; zhongyu wei; Zhihao Fan; Haijun Shan; Xuanjing Huang

2021 AAAI AAAI 2021

An Unsupervised Sampling Approach for Image-Sentence Matching Using Document-level Structural Information

Abstract

Abstract In this paper, we focus on the problem of unsupervised image-sentence matching. Existing research explores to utilize document-level structural information to sample positive and negative instances for model training. Although the approach achieves positive results, it introduces a sampling bias and fails to distinguish instances with high semantic similarity. To alleviate the bias, we propose a new sampling strategy to select additional intra-document image-sentence pairs as positive or negative samples. Furthermore, to recognize the complex pattern in intra-document samples, we propose a Transformer based model to capture fine-grained features and implicitly construct a graph for each document, where concepts in a document are introduced to bridge the representation learning of images and sentences in the context of a document. Experimental results show the effectiveness of our approach to alleviate the bias and learn well-aligned multimodal representations.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — document-level structure

🐣 Hot Topic Early Bird — multimodal representation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zejun Li , zhongyu wei , Zhihao Fan , Haijun Shan , Xuanjing Huang

Topics

Machine Learning > Core Methods > Embedding Learning Machine Learning > Learning Types > Unsupervised Learning Deep Learning > Architectures > Transformers Deep Learning > Learning Types > Self-Supervised Learning Artificial Intelligence > Core AI > Multi-Modal Learning Computer Vision > Processing > Image Retrieval

Keywords

unsupervised learning multimodal representation learning multimodal representation transformer model graph neural network image-sentence matching document-level structure

Download PDF

Related papers

Contextual Conditional Reasoning 2021

Attention Beam: An Image Captioning Approach (Student Abstract) 2021

Movie Summarization via Sparse Graph Construction 2021

Text Analysis for Understanding Symptoms of Social Anxiety in Student Veterans 2021

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs 2021