Finding Beans in Burgers: Deep Semantic-Visual Embedding With Localization

Martin Engilberge; Louis Chevallier; Patrick Pérez; Matthieu Cord

2018 CVPR CVPR 2018

Finding Beans in Burgers: Deep Semantic-Visual Embedding With Localization

Abstract

Several works have proposed to learn a two-path neural network that maps images and texts, respectively, to a same shared Euclidean space where geometry captures useful semantic relationships. Such a multi-modal embedding can be trained and used for various tasks, notably image captioning. In the present work, we introduce a new architecture of this type, with a visual path that leverages recent space-aware pooling mechanisms. Combined with a textual path which is jointly trained from scratch, our semantic-visual embedding offers a versatile model. Once trained under the supervision of captioned images, it yields new state-of-the-art performance on cross-modal retrieval. It also allows the localization of new concepts from the embedding space into any input image, delivering state-of-the-art result on the visual grounding of phrases.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — semantic-visual embedding

🐣 Hot Topic Early Bird — cross-modal retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Martin Engilberge , Louis Chevallier , Patrick Pérez , Matthieu Cord

Topics

Machine Learning > Core Methods > Embedding Learning Machine Learning > Application Areas > Domain Adaptation Computer Vision > Generation > Image Captioning Deep Learning > Learning Types > Representation Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

image captioning visual grounding cross-modal retrieval semantic-visual embedding

Download PDF

Related papers

Multi-Shot Pedestrian Re-Identification via Sequential Decision Making 2018

Multi-Cue Correlation Filters for Robust Visual Tracking 2018

Pointwise Convolutional Neural Networks 2018

Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking 2018

Image Generation From Scene Graphs 2018