VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Syed Talal Wasim; Muzammal Naseer; Salman Khan; Ming-Hsuan Yang; Fahad Shahbaz Khan

2024 CVPR CVPR 2024

VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Abstract

Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabularies our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions our model surpasses the recent best-performing models by 4.88 m_vIoU and 1.83 accuracy demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Syed Talal Wasim , Muzammal Naseer , Salman Khan , Ming-Hsuan Yang , Fahad Shahbaz Khan

Topics

Machine Learning > Learning Types > Zero-Shot Learning Computer Vision > Analysis > Scene Understanding Computer Vision > Processing > Video Understanding Computer Vision > Analysis > Video Understanding Natural Language Processing > Understanding > Natural Language Inference Artificial Intelligence > Core AI > Computer Vision Computer Vision > Core AI > Computer Vision Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

zero-shot learning visual grounding open-vocabulary recognition natural language spatio-temporal localization vision-language model video grounding text-to-video retrieval

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024