VIPHY: Probing “Visible” Physical Commonsense Knowledge

Shikhar Singh; Ehsan Qasemi; Muhao Chen

2023 EMNLP EMNLP 2023

VIPHY: Probing “Visible” Physical Commonsense Knowledge

Abstract

AbstractVision-language models (VLMs) have shown remarkable performance on visual reasoning tasks (e.g. attributes, location). While such tasks measure the requisite knowledge to ground and reason over a given visual instance, they do not, however, measure the ability of VLMs to retain and generalize such knowledge. In this work, we evaluate VLMs’ ability to acquire “visible” physical knowledge – the information that is easily accessible from images of static scenes, particularly along the dimensions of object color, size, and space. We build an automatic pipeline to derive a comprehensive knowledge resource for calibrating and probing these models. Our results indicate a severe gap between model and human performance across all three dimensions. Furthermore, we demonstrate that a caption pretrained LM significantly outperforms VLMs on both size and spatial tasks – highlighting that despite sufficient access to ground language with visual modality, they struggle to retain such knowledge.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Interdisciplinary and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shikhar Singh , Ehsan Qasemi , Muhao Chen

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Analysis > Scene Understanding Interdisciplinary > Linguistics > Computational Linguistics Machine Learning > Learning Types > Evaluation Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Models > Vision-Language Models Deep Learning > Learning Types > Evaluation

Keywords

commonsense knowledge knowledge probing visual reasoning vision-language model probing evaluation physical commonsense static scene

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023