CDTR: Semantic Alignment for Video Moment Retrieval Using Concept Decomposition Transformer

Ran Ran; Jiwei Wei; Xiangyi Cai; Xiang Guan; Jie Zou; Yang Yang; Heng Tao Shen

2025 AAAI AAAI 2025

CDTR: Semantic Alignment for Video Moment Retrieval Using Concept Decomposition Transformer

Abstract

Abstract Video Moment Retrieval (VMR) involves locating specific moments within a video based on natural language queries. However, existing VMR methods that employ various strategies for cross-modal alignment still face challenges such as limited understanding of fine-grained semantics, semantic overlap, and sparse constraints. To address these limitations, we propose a novel Concept Decomposition Transformer (CDTR) model for VMR. CDTR introduces a semantic concept decomposition module that disentangles video moments and sentence queries into concept representations, reflecting the relevance between various concepts and capturing fine-grained semantics which is crucial for cross-modal matching. These decomposed concept representations are then used as pseudo-labels, determined as positive or negative samples by adaptive concept-specific thresholds. Subsequently, fine-grained concept alignment is performed in video intra-modal and textual-visual cross-modal, aligning different conceptual components within features, enhancing the model's ability to distinguish fine-grained semantics, and alleviating issues related to semantic overlap and sparse constraints. Comprehensive experiments demonstrate the effectiveness of the CDTR, outperforming state-of-the-art methods on three widely used datasets: QVHighlights, Charades-STA, and TACoS.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ran Ran , Jiwei Wei , Xiangyi Cai , Xiang Guan , Jie Zou , Yang Yang , Heng Tao Shen

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Computer Vision > Processing > Video Understanding Natural Language Processing > Applications > Natural Language Inference

Keywords

semantic alignment cross-modal alignment semantic concept concept decomposition cross-modal matching video moment retrieval fine-grained semantics

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025