LATTE: Learning to Think with Vision Specialists

Zixian Ma; Jianguo Zhang; Zhiwei Liu; Jieyu Zhang; Juntao Tan; Manli Shu; Juan Carlos Niebles; Shelby Heinecke; Huan Wang; Caiming Xiong; Ranjay Krishna; Silvio Savarese

2025 EMNLP EMNLP 2025

LATTE: Learning to Think with Vision Specialists

Abstract

AbstractWhile open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require both perceptual and reasoning capabilities. We propose LATTE, a family of vision-language models that have LeArned to Think wiTh vision spEcialists. By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information. To train LATTE, we synthesize and filter a large dataset of 293K multi-modal reasoning traces over perceptual outputs of vision specialists. LATTE trained on this data achieves significant 4-5% gains over baselines across 6 benchmarks covering both perception and reasoning abilities. Ablation studies reveal that the effectiveness of multi-modal reasoning traces depends on the data sources, formats, and quality of thoughts.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — perceptual output

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zixian Ma , Jianguo Zhang , Zhiwei Liu , Jieyu Zhang , Juntao Tan , Manli Shu , Juan Carlos Niebles , Shelby Heinecke , Huan Wang , Caiming Xiong , Ranjay Krishna , Silvio Savarese

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Reasoning Deep Learning > Models > Large Language Models Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

knowledge distillation question answering multimodal learning vision-language model reasoning capability reasoning trace multi-modal reasoning perceptual output perceptual reasoning perceptual information vision specialist

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025