Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-training towards Retrieval

Mingchao Li; Xiaoming Shi; Haitao Leng; Wei Zhou; Hai-Tao Zheng; Kuncai Zhang

2023 AAAI AAAI 2023

Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-training towards Retrieval

Abstract

Abstract Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-training methods suffer from the semantic misalignments. The reason is that these methods ignore sequence alignments but focusing on critical token alignment. To alleviate the problem, we propose a video-language pre-training framework, termed videolanguage pre-training For lEarning sEmantic aLignments (FEEL), to learn semantic alignments at the sequence level. Specifically, the global modality reconstruction and the cross- modal self-contrasting method is utilized to learn the alignments at the sequence level better. Extensive experimental results demonstrate the effectiveness of FEEL on text-based video retrieval and text-based video corpus moment retrieval.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — text-based video retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mingchao Li , Xiaoming Shi , Haitao Leng , Wei Zhou , Hai-Tao Zheng , Kuncai Zhang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Generation > Video Generation Natural Language Processing > Applications > Information Retrieval Deep Learning > Learning Types > Contrastive Learning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

representation learning cross-modal learning video retrieval semantic alignment video-language pre-training text-based video retrieval modality reconstruction

Download PDF

Related papers

A Model-Agnostic Heuristics for Selective Classification 2023

Tackling Safe and Efficient Multi-Agent Reinforcement Learning via Dynamic Shielding (Student Abstract) 2023

Head-Free Lightweight Semantic Segmentation with Linear Transformer 2023

Hierarchical ConViT with Attention-Based Relational Reasoner for Visual Analogical Reasoning 2023

Deep Spiking Neural Networks with High Representation Similarity Model Visual Pathways of Macaque and Mouse 2023