FOIL it! Find One mismatch between Image and Language caption

Ravi Shekhar; Sandro Pezzelle; Yauhen Klimovich; Aurélie Herbelot; Moin Nabi; Enver Sangineto; Raffaella Bernardi

2017 ACL ACL 2017

FOIL it! Find One mismatch between Image and Language caption

Abstract

AbstractIn this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MS-COCO dataset, FOIL-COCO, which associates images with both correct and ‘foil’ captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word’). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — vision-language alignment

🐣 Hot Topic Early Bird — multimodal learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Robotics, Speech & Audio

Authors

Ravi Shekhar , Sandro Pezzelle , Yauhen Klimovich , Aurélie Herbelot , Moin Nabi , Enver Sangineto , Raffaella Bernardi

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Generation > Image Captioning Artificial Intelligence > Core AI > Computer Vision Computer Vision > Core AI > Computer Vision Deep Learning > Learning Types > Multi-Modal Learning Natural Language Processing > Applications > Natural Language Understanding

Keywords

multimodal learning image captioning vision-language alignment image-text mismatch caption verification multi-modal learning vision-language model semantic mismatch foil detection

Download PDF

Related papers

A* CCG Parsing with a Supertag and Dependency Factored Model 2017

Detecting annotation noise in automatically labelled data 2017

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) 2017

Annotating tense, mood and voice for English, French and German 2017

Word Embedding for Response-To-Text Assessment of Evidence 2017