Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

Michal Golovanevsky; William Rudman; Michael A. Lepori; Amir Bar; Ritambhara Singh; Carsten Eickhoff

2025 EMNLP EMNLP 2025

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

Abstract

AbstractMultimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 99.3% of color and 80.8% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — activation intervention

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Michal Golovanevsky , William Rudman , Michael A. Lepori , Amir Bar , Ritambhara Singh , Carsten Eickhoff

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Large Language Models Computer Vision > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Knowledge Representation Deep Learning > Learning Types > Representation Learning Deep Learning > Models > Vision-Language Models

Keywords

vision-language model multimodal reasoning knowledge prior activation steering model prediction reasoning analysis visual counterfactual activation intervention representational dynamics

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025