VC4VG: Optimizing Video Captions for Text-to-Video Generation

Yang Du; Zhuoran Lin; Kaiqiang Song; Biao Wang; Zhicheng Zheng; Tiezheng Ge; Bo Zheng; Qin Jin

2025 EMNLP EMNLP 2025

VC4VG: Optimizing Video Captions for Text-to-Video Generation

Abstract

AbstractRecent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code (https://github.com/qyr0403/VC4VG) to support further research.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — caption optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yang Du , Zhuoran Lin , Kaiqiang Song , Biao Wang , Zhicheng Zheng , Tiezheng Ge , Bo Zheng , Qin Jin

Topics

Computer Vision > Generation > Image Captioning Computer Vision > Generation > Video Generation Deep Learning > Learning Types > Multi-Modal Learning Natural Language Processing > Applications > Image Captioning

Keywords

benchmark evaluation video captioning multimodal learning video-text alignment text-to-video generation caption optimization

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025