Comprehension-Guided Referring Expressions

Ruotian Luo; Gregory Shakhnarovich

2017 CVPR CVPR 2017

Comprehension-Guided Referring Expressions

Abstract

We consider generation and comprehension of natural language referring expression for objects in an image. Unlike generic "image captioning" which lacks natural standard evaluation criteria, quality of a referring expression may be measured by the receiver's ability to correctly infer which object is being described. Following this intuition, we propose two approaches to utilize models trained for comprehension task to generate better expressions. First, we use a comprehension module trained on human-generated expressions, as a "critic" of referring expression generator. The comprehension module serves as a differentiable proxy of human evaluation, providing training signal to the generation module. Second, we use the comprehension model in a generate-and-rerank pipeline, which chooses from candidate expressions generated by a model according to their performance on the comprehension task. We show that both approaches lead to improved referring expression generation on multiple benchmark datasets.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Natural Language Processing

📈 Trend Setter — Natural Language Generation

🐣 Hot Topic Early Bird — multi-modal learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ruotian Luo , Gregory Shakhnarovich

Topics

Computer Vision > Generation > Image Captioning Computer Vision > Core AI > Computer Vision Deep Learning > Learning Types > Multi-Modal Learning Natural Language Processing > Applications > Natural Language Generation

Keywords

natural language generation multimodal learning image captioning multi-modal learning referring expression visual grounding language generation referring expression generation

Download PDF

Related papers

Deep Outdoor Illumination Estimation 2017

SRN: Side-output Residual Network for Object Symmetry Detection in the Wild 2017

Weakly Supervised Semantic Segmentation Using Web-Crawled Videos 2017

FASON: First and Second Order Information Fusion Network for Texture Recognition 2017

Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization 2017