Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning

Ruizhe Chen; Tianze Luo; Zhiting Fan; Heqing Zou; Zhaopeng Feng; Guiyang Xie; Hansheng Zhang; Zhuochen Wang; Zuozhu Liu; Zhang Huaijian

2025 EMNLP EMNLP 2025

Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning

Abstract

AbstractVideo Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introduce a two-stage training framework that integrates supervised fine-tuning with reinforcement learning (RL) to improve both the accuracy and robustness of VTG models. Our approach first leverages high-quality curated cold-start data for SFT initialization, followed by difficulty-controlled RL to further enhance temporal localization and reasoning abilities. Comprehensive experiments on multiple VTG benchmarks demonstrate that our method consistently outperforms existing models, particularly in challenging and open-domain scenarios. We conduct an in-depth analysis of training strategies and dataset curation, highlighting the importance of both high-quality cold-start data and difficulty-controlled RL. To facilitate further research and industrial adoption, we release all intermediate datasets, models, and code to the community.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Reinforcement Learning

🧭 Keyword Pioneer — difficulty-controlled reinforcement learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ruizhe Chen , Tianze Luo , Zhiting Fan , Heqing Zou , Zhaopeng Feng , Guiyang Xie , Hansheng Zhang , Zhuochen Wang , Zuozhu Liu , Zhang Huaijian

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Trajectory Prediction Computer Vision > Processing > Video Understanding Reinforcement Learning > Methods > Deep RL Reinforcement Learning > Methods > Policy Learning

Keywords

reinforcement learning video temporal grounding vision-language model supervised fine-tuning large vision-language model temporal localization difficulty-controlled reinforcement learning

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025