Video-Text Pre-training with Learned Regions for Retrieval

Rui Yan; Mike Zheng Shou; Yixiao Ge; Jinpeng Wang; Xudong Lin; Guanyu Cai; Jinhui Tang

2023 AAAI AAAI 2023

Video-Text Pre-training with Learned Regions for Retrieval

Abstract

Abstract Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract visual features from raw pixels in an end-to-end fashion. However, these methods operate at frame-level directly and thus overlook the spatio-temporal structure of objects in video, which yet has a strong synergy with nouns in textual descriptions. In this work, we propose a simple yet effective module for video-text representation learning, namely RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs. Given a video, our module (1) first quantizes continuous visual features via clustering patch-features into the same cluster according to content similarity, then (2) generates learnable masks to aggregate fragmentary features into regions with complete semantics, and finally (3) models the spatio-temporal dependencies between different semantic regions. In contrast to using off-the-shelf object detectors, our proposed module does not require explicit supervision and is much more computationally efficient. We pre-train the proposed approach on the public WebVid2M and CC3M datasets. Extensive evaluations on four downstream video-text retrieval benchmarks clearly demonstrate the effectiveness of our RegionLearner.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — video-text pre-training

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rui Yan , Mike Zheng Shou , Yixiao Ge , Jinpeng Wang , Xudong Lin , Guanyu Cai , Jinhui Tang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Techniques > Pretraining Natural Language Processing > Applications > Information Retrieval Computer Vision > Core AI > Multimodal Learning Machine Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Techniques > Transfer Learning Artificial Intelligence > Core AI > Information Retrieval

Keywords

representation learning contrastive learning transfer learning feature quantization multimodal learning video retrieval video-text retrieval semantic region video-text pre-training region learning

Download PDF

Related papers

A Model-Agnostic Heuristics for Selective Classification 2023

Tackling Safe and Efficient Multi-Agent Reinforcement Learning via Dynamic Shielding (Student Abstract) 2023

Head-Free Lightweight Semantic Segmentation with Linear Transformer 2023

Hierarchical ConViT with Attention-Based Relational Reasoner for Visual Analogical Reasoning 2023

Deep Spiking Neural Networks with High Representation Similarity Model Visual Pathways of Macaque and Mouse 2023