Multimodal Convolutional Neural Networks for Matching Image and Sentence

Lin Ma; Zhengdong Lu; Lifeng Shang; Hang Li

2015 ICCV ICCV 2015

Multimodal Convolutional Neural Networks for Matching Image and Sentence

Abstract

In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content and one matching CNN modeling the joint representation of image and sentence. The matching CNN composes different semantic fragments from words and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching. More specifically, our proposed m-CNNs significantly outperform the state-of-the-art approaches for bidirectional image and sentence retrieval on the Flickr8K and Flickr30K datasets.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — image sentence matching

🐣 Hot Topic Early Bird — multimodal learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Lin Ma , Zhengdong Lu , Lifeng Shang , Hang Li

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Neural Networks

Keywords

multimodal learning convolutional neural network joint representation semantic embedding image sentence matching

Download PDF

Related papers

Cutting Edge: Soft Correspondences in Multimodal Scene Parsing 2015

Unsupervised Generation of a Viewpoint Annotated Car Dataset From Videos 2015

Depth-Based Hand Pose Estimation: Data, Methods, and Challenges 2015

Peeking Template Matching for Depth Extension 2015

Learning Semi-Supervised Representation Towards a Unified Optimization Framework for Semi-Supervised Learning 2015