ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning

Rui Wang; Bohao Li; Xiyang Dai; Jianwei Yang; Yi-Ling Chen; Zhen Xing; Yifan Yang; Dongdong Chen; Xipeng Qiu; Zuxuan Wu; Yu-Gang Jiang

2025 EMNLP EMNLP 2025

ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning

Abstract

AbstractVideo understanding is essential for multimodal large language models (MLLMs) to interact effectively with users and the real world. However, analyzing long videos remains a major challenge due to the lack of high-quality video instruction data and effective training strategies. In this paper, we introduce a simple yet effective baseline for long-context video understanding, including dataset construction and training recipes. We curate a large-scale video instruction dataset with over 1M samples, encompassing videos from a few seconds to several minutes across diverse sources, without any human annotations. Additionally, we propose a progressive video instruction tuning strategy that incrementally increases input context length, enabling better utilization of videos of varying durations. Comprehensive experiments demonstrate that our dataset significantly outperforms existing video instruction datasets for fine-tuning MLLMs. Furthermore, our training approach establishes a strong video MLLM baseline, surpassing previous open-source models on video benchmarks and outperforming proprietary models like GPT-4V and GPT-4o-mini on VideoMME, even with a compact 7B model.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — long-context video

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rui Wang , Bohao Li , Xiyang Dai , Jianwei Yang , Yi-Ling Chen , Zhen Xing , Yifan Yang , Dongdong Chen , Xipeng Qiu , Zuxuan Wu , Yu-Gang Jiang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Transfer Learning Natural Language Processing > Resources & Methods > Large Language Models Deep Learning > Models > Large Language Models Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Fine-Tuning Computer Vision > Applications > Video Understanding

Keywords

multimodal learning video understanding instruction tuning multimodal large language model progressive training long-context modeling long-context video video instruction dataset video instruction

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025