Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Yunze Man; De-An Huang; Guilin Liu; Shiwei Sheng; Shilong Liu; Liang-Yan Gui; Jan Kautz; Yu-Xiong Wang; Zhiding Yu

2025 CVPR CVPR 2025

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Abstract

Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Natural Language Processing

🧭 Keyword Pioneer — grounded chain of thought

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yunze Man , De-An Huang , Guilin Liu , Shiwei Sheng , Shilong Liu , Liang-Yan Gui , Jan Kautz , Yu-Xiong Wang , Zhiding Yu

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Reasoning Computer Vision > Core AI > Multimodal Learning

Keywords

object detection visual reasoning visual grounding multimodal large language model chain of thought grounded chain of thought vision-centric reasoning object-centric attention region of interest engagement

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025