Deep Visual-Semantic Alignments for Generating Image Descriptions

Andrej Karpathy; Li Fei-fei

2015 CVPR CVPR 2015

Deep Visual-Semantic Alignments for Generating Image Descriptions

Abstract

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

📈 Trend Setter — Transformers

🧭 Keyword Pioneer — visual-semantic alignment

🐣 Hot Topic Early Bird — multimodal learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Andrej Karpathy , Li Fei-fei

Topics

Deep Learning > Architectures > Transformers Deep Learning > Models > Generative Models Computer Vision > Generation > Image Captioning Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Architectures > Recurrent Neural Networks

Keywords

multimodal learning image captioning convolutional neural network recurrent neural network visual-semantic alignment multimodal embedding bidirectional recurrent neural network

Download PDF

Related papers

Long-Term Correlation Tracking 2015

Hierarchically-Constrained Optical Flow 2015

Propagated Image Filtering 2015

Web Scale Photo Hash Clustering on A Single Machine 2015

Expanding Object Detector's Horizon: Incremental Learning Framework for Object Detection in Videos 2015