Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

Omkar Thawakar; Dmitry Demidov; Ritesh Thawkar; Rao Muhammad Anwer; Mubarak Shah; Fahad Shahbaz Khan; Salman Khan

2025 ICCV ICCV 2025

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

Abstract

Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content.The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model will be publicly released.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — dense modification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Omkar Thawakar , Dmitry Demidov , Ritesh Thawkar , Rao Muhammad Anwer , Mubarak Shah , Fahad Shahbaz Khan , Salman Khan

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Domain Adaptation Deep Learning > Architectures > Transformers

Keywords

multimodal learning video retrieval text encoder dense modification

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025