Tencent-MVSE: A Large-Scale Benchmark Dataset for Multi-Modal Video Similarity Evaluation

Zhaoyang Zeng; Yongsheng Luo; Zhenhua Liu; Fengyun Rao; Dian Li; Weidong Guo; Zhen Wen

2022 CVPR CVPR 2022

Tencent-MVSE: A Large-Scale Benchmark Dataset for Multi-Modal Video Similarity Evaluation

Abstract

Multi-modal video similarity evaluation is important for video recommendation systems such as video de-duplication, relevance matching, ranking, and diversity control. However, there still lacks a benchmark dataset that can support supervised training and accurate evaluation. In this paper, we propose the Tencent-MVSE dataset, which is the first benchmark dataset for the multi-modal video similarity evaluation task. The Tencent-MVSE dataset contains video pairs similarity annotations, and diverse metadata including Chinese title, automatic speech recognition (ASR) text, as well as human-annotated categories/tags. We provide a simple baseline with a multi-modal Transformer architecture to perform supervised multi-modal video similarity evaluation. We also explore pre-training strategies to make use of the unpaired data. The whole dataset as well as our baseline will be released to promote the development of the multi-modal video similarity evaluation. The dataset has been released in https://tencent-mvse.github.io/.

🌉 Interdisciplinary Bridge — Data Science & Analytics and Deep Learning and Machine Learning

📈 Trend Setter — Multi-Modal Learning

🧭 Keyword Pioneer — similarity evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhaoyang Zeng , Yongsheng Luo , Zhenhua Liu , Fengyun Rao , Dian Li , Weidong Guo , Zhen Wen

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Semi-Supervised Learning Deep Learning > Architectures > Transformers Data Science & Analytics > Applications > Recommender Systems Machine Learning > Learning Types > Multi-Modal Learning Deep Learning > Models > Multi-Modal Learning

Keywords

transformer architecture information retrieval multimodal learning supervised learning multi-modal learning video recommendation video similarity pre-training strategies similarity evaluation

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022