CVT5: Using Compressed Video Encoder and UMT5 for Dense Video Captioning

Mohammad Javad Pirhadi; Motahhare Mirzaei; Sauleh Eetemadi

2025 COLING COLING 2025

CVT5: Using Compressed Video Encoder and UMT5 for Dense Video Captioning

Abstract

AbstractThe dense video captioning task aims to detect all events occurring in a video and describe each event using natural language. Unlike most other video processing tasks, where it is typically assumed that videos contain only a single main event, this task deals with long, untrimmed videos. Consequently, the speed of processing videos in dense video captioning is a critical aspect of the system. To the best of our knowledge, all published work on this task uses RGB frames to encode input videos. In this work, we introduce the use of compressed videos for the first time in this task. Our experiments on the SoccerNet challenge demonstrate significant improvements in both processing speed and GPU memory footprint while achieving competitive results. Additionally, we leverage multilingual transcripts, which seems to be effective. The encoder in our proposed method achieves approximately 5.4× higher speed and 5.1× lower GPU memory usage during training, and 4.7× higher speed and 7.8× lower GPU memory usage during inference, compared to its RGB-based counterpart. The code is publicly available at https://github.com/mohammadjavadpirhadi/CVT5.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — multilingual transcript

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning

Authors

Mohammad Javad Pirhadi , Motahhare Mirzaei , Sauleh Eetemadi

Topics

Deep Learning > Architectures > Transformers Computer Vision > Generation > Video Generation Computer Vision > Processing > Video Processing

Keywords

event detection video description dense video captioning video encoder compressed video multilingual transcript video event detection compressed video encoder

Download PDF

Related papers

Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection 2025

TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution 2025

Positive Text Reframing under Multi-strategy Optimization 2025

RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration 2025

Two-stage Incomplete Utterance Rewriting on Editing Operation 2025