DRIVINGVQA: A Dataset for Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios

Charles Corbière; Simon Roburin; Syrielle Montariol; Antoine Bosselut; Alexandre Alahi

2026 EACL EACL 2026

DRIVINGVQA: A Dataset for Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios

Abstract

AbstractWhile chain-of-thought (CoT) prompting improves reasoning in large language models, its effectiveness in vision-language models (VLMs) remains limited due to over-reliance on textual cues and memorized knowledge. To investigate the visual reasoning capabilities of VLMs in complex real-world scenarios, we introduce DrivingVQA, a visual question answering dataset derived from driving theory exams, which contains 3,931 multiple-choice problems with expert-written explanations and grounded entities relevant to the reasoning process. Leveraging this dataset, we explore the benefits of incorporating entity-related information, such as entity names, spatial coordinates, and visual content, through supervised fine-tuning to enhance the model’s reasoning abilities. Our experiments demonstrate that interleaving textual explanations with visual tokens extracted from entities relevant to the question improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting. Furthermore, we demonstrate that this retrieval-based approach effectively scales to the larger A-OKVQA reasoning dataset by leveraging automatically generated pseudo-labels, outperforming CoT prompting.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Charles Corbière , Simon Roburin , Syrielle Montariol , Antoine Bosselut , Alexandre Alahi

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Applications > Question Answering

Keywords

retrieval augmented generation visual question answering vision language model entity grounding chain of thought

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026