VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Hu Xu; Gargi Ghosh; Po-Yao Huang; Dmytro Okhonko; Armen Aghajanyan; Florian Metze; Luke Zettlemoyer; Christoph Feichtenhofer

2021 EMNLP EMNLP 2021

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Abstract

AbstractWe present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🧭 Keyword Pioneer — video-text understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hu Xu , Gargi Ghosh , Po-Yao Huang , Dmytro Okhonko , Armen Aghajanyan , Florian Metze , Luke Zettlemoyer , Christoph Feichtenhofer

Topics

Machine Learning > Learning Types > Contrastive Learning Computer Vision > Processing > Video Understanding

Keywords

contrastive learning zero-shot learning multimodal learning video-text understanding

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021