Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Liwei Wang; Alexander Schwing; Svetlana Lazebnik

2017 NIPS NeurIPS 2017

Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Abstract

This paper explores image caption generation using conditional variational auto-encoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around K components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a “vanilla” CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.

🌉 Interdisciplinary Bridge — Computer Science and Computer Vision and Deep Learning

🧭 Keyword Pioneer — additive gaussian prior

🐣 Hot Topic Early Bird — latent space

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Liwei Wang , Alexander Schwing , Svetlana Lazebnik

Topics

Deep Learning > Models > Generative Models Deep Learning > Models > Variational Inference Computer Vision > Generation > Image Captioning Computer Science > Systems > Computer Graphics

Keywords

image captioning generative model gaussian mixture model latent space variational auto-encoder additive gaussian prior

Download PDF

Related papers

High-Order Attention Models for Visual Question Answering 2017

Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization 2017

Premise Selection for Theorem Proving by Deep Graph Embedding 2017

Neural Program Meta-Induction 2017

Safe and Nested Subgame Solving for Imperfect-Information Games 2017