2025 ICCV ICCV 2025

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Abstract

Recent advancements in Vision-Language Models (VLMs) have fueled interest in autonomous driving applications, particularly for interpretable decision-making. However, the assumption that VLMs provide visually grounded and reliable driving explanations remains unexamined. To address this, we introduce DriveBench, a benchmark evaluating 12 VLMs across 17 settings, covering 19,200 images, 20,498 QA pairs, and four key driving tasks. Our findings reveal that VLMs often generate plausible responses from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs possess inherent corruption-awareness but only explicitly acknowledge these issues when directly prompted. Given the challenges and inspired by the inherent corruption awareness, we propose Robust Agentic Utilization (RAU), leveraging VLMs' corruption awareness and agentic planning with external tools to enhance perception reliability for downstream tasks. Our study challenges existing evaluation paradigms and provides a roadmap toward more robust and interpretable autonomous driving systems.

The Questioner
🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning
🧭 Keyword Pioneer — corruption awareness
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio