Understanding the Visual Projection Space of Multimodal LLMs

SungHeon Jeong; Yoojeong Song; Hyungjoon Kim

2026 WACV WACV 2026

Understanding the Visual Projection Space of Multimodal LLMs

Abstract

What role does a single vision token play inside a multimodal large language model (MLLM)? Despite recent successes, most MLLMs adopt a simple design: a projected visual feature z = P(f_x) prepended to the text sequence. Yet it remains unclear whether this vector merely provides context or actively steers generation. We propose a geometric probing framework that analyzes latent-token alignment, intrinsic dimensionality, and perturbation sensitivity. Across four datasets and three representative MLLMs (LLaVA, BLIP-2, Kosmos-2), we find clear operating regimes: BLIP-2 enforces rigid low-rank compression with strong alignment but near-zero sensitivity, LLaVA exhibits flexible high-dimensional mappings with high responsiveness, and Kosmos-2 balances between them. These signatures correlate with downstream behavior--SQA correctness and VQAv2 hallucination severity--showing that reduced or excessive sensitivity predicts unreliable grounding. Our results highlight geometry as a diagnostic lens for vision-language coupling and offer actionable guidance for projection design, alignment objectives, and user-steerable multimodal generation.

🧭 Keyword Pioneer — vision projection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio