End-to-End Generative Pretraining for Multimodal Video Captioning

Paul Hongsuck Seo; Arsha Nagrani; Anurag Arnab; Cordelia Schmid

2022 CVPR CVPR 2022

End-to-End Generative Pretraining for Multimodal Video Captioning

Abstract

Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective -- we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as generative and discriminative VideoQA, video retrieval and action classification.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — multimodal video captioning

🐣 Hot Topic Early Bird — video language model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Paul Hongsuck Seo , Arsha Nagrani , Anurag Arnab , Cordelia Schmid

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Computer Vision > Generation > Image Captioning Natural Language Processing > Generation > Text Generation Deep Learning > Learning Types > Multi-Modal Learning

Keywords

video captioning multimodal learning video understanding video language model encoder-decoder model bidirectional generation generative pretraining multimodal encoder multimodal video captioning

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022