CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao; Yao Lu; Moo Jin Kim; Zipeng Fu; Zhuoyang Zhang; Yecheng Wu; Zhaoshuo Li; Qianli Ma; Song Han; Chelsea Finn; Ankur Handa; Tsung-Yi Lin; Gordon Wetzstein; Ming-Yu Liu; Donglai Xiang

2025 CVPR CVPR 2025

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Abstract

Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Videos are available at: https://cot-vla.github.io/.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Robotics

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics

Authors

Qingqing Zhao , Yao Lu , Moo Jin Kim , Zipeng Fu , Zhuoyang Zhang , Yecheng Wu , Zhaoshuo Li , Qianli Ma , Song Han , Chelsea Finn , Ankur Handa , Tsung-Yi Lin , Gordon Wetzstein , Ming-Yu Liu , Donglai Xiang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Planning Artificial Intelligence > Core AI > Reasoning Artificial Intelligence > Core AI > Multi-Modal Learning Robotics > Applications > Robotics

Keywords

robotic manipulation chain-of-thought reasoning robot manipulation visual planning vision-language-action model chain of thought reasoning visual goal prediction sensorimotor control

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025