VOVTrack: Exploring the Potentiality in Raw Videos for Open-Vocabulary Multi-Object Tracking

Zekun Qian; Ruize Han; Junhui Hou; Linqi Song; Wei Feng

2025 ICCV ICCV 2025

VOVTrack: Exploring the Potentiality in Raw Videos for Open-Vocabulary Multi-Object Tracking

Abstract

Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, not fully leveraging the video information. In this work, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video analysis standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate detection (localization and classification) of time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object tracking (association). Experimental results underscore that VOVTrack establishes itself as a state-of-the-art solution for the open-vocabulary tracking task.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

🧭 Keyword Pioneer — prompt-guided attention

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zekun Qian , Ruize Han , Junhui Hou , Linqi Song , Wei Feng

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Object Tracking Machine Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Computer Vision

Keywords

object detection open-vocabulary detection self-supervised learning video understanding video analysis multi-object tracking prompt-guided attention object association

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025