Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Haibo Wang; Zhiyang Xu; Yu Cheng; SHIZHE DIAO; Yufan Zhou; Yixin Cao; Qifan Wang; Weifeng Ge; Lifu Huang

2025 EMNLP EMNLP 2025

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Abstract

AbstractDespite their impressive performance in coarse-grained video understanding, Video Large Language Models (Video-LLMs) still face challenges in fine-grained temporal grounding, including ineffective temporal modeling and inadequate timestamp representations. In this work, we introduce Grounded-VideoLLM, a novel Video-LLM designed to perceive and reason over specific video moments with fine-grained temporal precision. Our model features (1) a two-stream encoder that explicitly captures inter-frame relationships while preserving intra-frame visual details and (2) discrete temporal tokens enriched with structured time knowledge for timestamp representation. Besides, we propose a multi-stage training strategy tailored to such grounding-specific architecture. The model is initially trained on simple video-caption tasks and progressively introduced to complex video temporal grounding tasks, ensuring a smooth learning curve and temporal alignment. We further strengthen Grounded-VideoLLM’s temporal reasoning by constructing a VideoQA dataset with grounded information using an automated annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only surpasses existing models in fine-grained grounding tasks but also exhibits strong potential as a general video understanding assistant.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — timestamp representation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Haibo Wang , Zhiyang Xu , Yu Cheng , SHIZHE DIAO , Yufan Zhou , Yixin Cao , Qifan Wang , Weifeng Ge , Lifu Huang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Large Language Models Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

multimodal learning video understanding temporal grounding fine-grained understanding fine-grained reasoning video large language model large language model timestamp representation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025