Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions

David Acuna; Ximing Lu; Jaehun Jung; Hyunwoo Kim; Amlan Kar; Sanja Fidler; Yejin Choi

2025 EMNLP EMNLP 2025

Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions

Abstract

AbstractRecent research in vision-language models (VLMs) has centered around the possibility of equipping them with implicit long-form chain-of-thought reasoning—akin to the success observed in language models—via distillation and reinforcement learning. But what about the non-reasoning models already trained and deployed across the internet? Should we simply abandon them, or is there hope for a search mechanism that can elicit hidden knowledge and induce long reasoning traces— without any additional training or supervision? In this paper, we explore this possibility using a Monte Carlo Tree Search (MCTS)-inspired algorithm, which injects subquestion–subanswer pairs into the model’s output stream. We show that framing reasoning as a search process—where subquestions act as latent decisions within a broader inference trajectory—helps the model “connect the dots” between fragmented knowledge and produce extended reasoning traces in non-reasoning models. We evaluate our method across three benchmarks and observe consistent improvements. Notably, our approach yields a 2% overall improvement on MMMU-PRO, including a significant 9% gain in Liberal Arts.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🧭 Keyword Pioneer — subquestion answering

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

David Acuna , Ximing Lu , Jaehun Jung , Hyunwoo Kim , Amlan Kar , Sanja Fidler , Yejin Choi

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Planning Artificial Intelligence > Core AI > Reasoning Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

chain-of-thought reasoning visual reasoning monte carlo tree search vision-language model test-time scaling subquestion answering test-time inference

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025