VCA: Video Curious Agent for Long Video Understanding

Zeyuan Yang; Delin Chen; Xueyang Yu; Maohao Shen; Chuang Gan

2025 ICCV ICCV 2025

VCA: Video Curious Agent for Long Video Understanding

Abstract

Long video understanding poses unique challenges due to its temporal complexity and low information density. Recent works address this task by sampling numerous frames or incorporating auxiliary tools using LLMs, both of which result in high computational costs. In this work, we introduce a curiosity-driven video agent with self-exploration capability, dubbed as "VCA". Built upon VLMs, VCA autonomously navigates video segments and efficiently builds a comprehensive understanding of complex video sequences.Instead of directly sampling frames, VCA employs a tree-search structure to explore video segments and collect frames. Rather than relying on external feedback or reward, VCA leverages VLM's self-generated intrinsic reward to guide its exploration, enabling it to capture the most crucial information for reasoning. Experimental results on multiple long video benchmarks demonstrate our approach's superior effectiveness and efficiency.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zeyuan Yang , Delin Chen , Xueyang Yu , Maohao Shen , Chuang Gan

Topics

Artificial Intelligence > Core AI > Agent Systems Machine Learning > Learning Types > Active Learning Computer Vision > Processing > Video Understanding Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Reinforcement Learning

Keywords

video understanding tree search vision language model vision-language model intrinsic reward curiosity-driven exploration long video

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025