Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz; Hanoona Rasheed; Salman Khan; Fahad Khan

2024 ACL ACL 2024

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Abstract

AbstractConversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the under-explored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with an LLM. The resulting model is capable of understanding and generating detailed conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantitative evaluation framework for video-based dialogue models to objectively analyze the strengths and weaknesses of video-based dialogue models. Code: https://github.com/mbzuai-oryx/Video-ChatGPT.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — video instruction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Muhammad Maaz , Hanoona Rasheed , Salman Khan , Fahad Khan

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Generation > Dialogue Systems Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

multimodal learning video understanding vision language model visual encoder visual language model large language model conversation system video instruction video conversation

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024