AutoAD III: The Prequel - Back to the Pixels

Tengda Han; Max Bain; Arsha Nagrani; Gül Varol; Weidi Xie; Andrew Zisserman

2024 CVPR CVPR 2024

AutoAD III: The Prequel - Back to the Pixels

Abstract

Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently visual language models for AD generation are limited by a lack of suitable training data and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well matched to human performance. Taken together we improve the state of the art on AD generation.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tengda Han , Max Bain , Arsha Nagrani , Gül Varol , Weidi Xie , Andrew Zisserman

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Generation > Text Generation

Keywords

video captioning multimodal learning video understanding fine-grained understanding audio description generation

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024