CALVIN: Improved Contextual Video Captioning via Instruction Tuning

Gowthami Somepalli; Arkabandhu Chowdhury; Ronen Basri; Jonas Geiping; Tom Goldstein; David Jacobs

2024 NIPS NeurIPS 2024

CALVIN: Improved Contextual Video Captioning via Instruction Tuning

Abstract

The recent emergence of powerful Vision-Language models (VLMs) has significantly improved image captioning. Some of these models are extended to caption videos as well. However, their capabilities to understand complex scenes are limited, and the descriptions they provide for scenes tend to be overly verbose and focused on the superficial appearance of objects. Scene descriptions, especially in movies, require a deeper contextual understanding, unlike general-purpose video captioning. To address this challenge, we propose a model, CALVIN, a specialized video LLM that leverages previous movie context to generate fully "contextual" scene descriptions. To achieve this, we train our model on a suite of tasks that integrate both image-based question-answering and video captioning within a unified framework, before applying instruction tuning to refine the model's ability to provide scene captions. Lastly, we observe that our model responds well to prompt engineering and few-shot in-context learning techniques, enabling the user to adapt it to any new movie with very little additional annotation.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🧭 Keyword Pioneer — contextual video understanding

Authors

Gowthami Somepalli , Arkabandhu Chowdhury , Ronen Basri , Jonas Geiping , Tom Goldstein , David Jacobs

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Computer Vision > Generation > Video Generation Computer Vision > Processing > Video Understanding Natural Language Processing > Generation > Text Generation Natural Language Processing > Applications > Summarization Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

few-shot learning scene understanding video captioning in-context learning instruction tuning vision-language model contextual understanding scene description contextual video understanding

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024