ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense

Kankan Zhou; Eason Lai; Wei Bin Au Yeong; Kyriakos Mouratidis; Jing Jiang

2023 EMNLP EMNLP 2023

ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense

Abstract

AbstractHumans possess a strong capability for reasoning beyond common sense. For example, given an unconventional image of a goldfish laying on the table next to an empty fishbowl, a human would effortlessly determine that the fish is not inside the fishbowl. The case, however, may be different for a vision-language model, whose reasoning could gravitate towards the common scenario that the fish is inside the bowl, despite the visual input. In this paper, we introduce a novel probing dataset named ROME (reasoning beyond commonsense knowledge) to evaluate whether the state-of-the-art pre-trained vision-language models have the reasoning capability to correctly interpret counter-intuitive content. ROME contains images that defy commonsense knowledge with regards to color, shape, material, size and positional relation. Experiments on the state-of-the-art pre-trained vision-language models reveal that most of these models are still largely incapable of interpreting counter-intuitive scenarios. We hope that ROME will spur further investigations on reasoning beyond commonsense knowledge in vision-language research.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — counter-intuitive reasoning

🐣 Hot Topic Early Bird — image understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Kankan Zhou , Eason Lai , Wei Bin Au Yeong , Kyriakos Mouratidis , Jing Jiang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers Computer Vision > Analysis > Scene Understanding Artificial Intelligence > Core AI > Reasoning Deep Learning > Models > Vision-Language Models Computer Vision > Applications > Visual Question Answering

Keywords

model evaluation visual reasoning vision-language model counterfactual reasoning image understanding multimodal reasoning commonsense reasoning visual common sense counter-intuitive reasoning probing dataset

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023