Retrieval-Augmented Egocentric Video Captioning

Jilan Xu; Yifei Huang; Junlin Hou; Guo Chen; Yuejie Zhang; Rui Feng; Weidi Xie

2024 CVPR CVPR 2024

Retrieval-Augmented Egocentric Video Captioning

Abstract

Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper (1) we develop EgoInstructor a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos (2) for training the cross-view retrieval module we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets (3) we train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions (4) through extensive experiments our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning EgoInstructor exhibits significant improvements by leveraging third-person videos as references.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jilan Xu , Yifei Huang , Junlin Hou , Guo Chen , Yuejie Zhang , Rui Feng , Weidi Xie

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Techniques > Pretraining Computer Vision > Generation > Image Captioning Computer Vision > Generation > Video Generation Computer Vision > Processing > Video Understanding Computer Vision > Domain-Specific > Egocentric Vision Machine Learning > Learning Types > Retrieval-Augmented Generation Natural Language Processing > Generation > Retrieval-Augmented Generation Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Retrieval-Augmented Generation

Keywords

video captioning multimodal learning retrieval-augmented generation egocentric video text feature retrieval-augmented learning cross-view retrieval

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024