Multimodal High-order Relation Transformer for Scene Boundary Detection

Xi Wei; Zhangxiang Shi; Tianzhu Zhang; Xiaoyuan Yu; Lei Xiao

2023 ICCV ICCV 2023

Multimodal High-order Relation Transformer for Scene Boundary Detection

Abstract

Scene boundary detection breaks down long videos into meaningful story-telling units and plays a crucial role in high-level video understanding. Despite significant advancements in this area, this task remains a challenging problem as it requires a comprehensive understanding of multimodal cues and high-level semantics. To tackle this issue, we propose a multimodal high-order relation transformer, which integrates a high-order encoder and an adaptive decoder in a unified framework. By modeling the multimodal cues and exploring similarities between the shots, the encoder is capable of capturing high-order relations between shots and extracting shot features with context semantics. By clustering the shots adaptively, the decoder can discover more universal switch pattern between successive scenes, thus helping scene boundary detection. Extensive experimental results on three standard benchmarks demonstrate that the proposed model performs favorably against state-of-the-art video scene detection methods.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xi Wei , Zhangxiang Shi , Tianzhu Zhang , Xiaoyuan Yu , Lei Xiao

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Action Recognition Computer Vision > Processing > Video Understanding

Keywords

multimodal learning video understanding high-order relation scene boundary detection

Download PDF

Related papers

PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework 2023

Periodically Exchange Teacher-Student for Source-Free Object Detection 2023

Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations 2023

Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles 2023

3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation 2023