Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Wentao Bao; Kai Li; Yuxiao Chen; Deep A Patel; Martin Renqiang Min; Yu Kong

2025 WACV WACV 2025

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Abstract

Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically the OpenMixer is developed by spatial and temporal OpenMixer blocks (S-OMB and T-OMB) and a dynamically fused alignment (DFA) module. The three components collectively enjoy the merits of strong generalization from pre-trained VLMs and end-to-end learning from DETR design. Moreover we established OVAD benchmarks under various settings and the experimental results show that the OpenMixer performs the best over baselines for detecting seen and unseen actions. We release the codes models and dataset splits at: https://github.com/Cogito2012/OpenMixer.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wentao Bao , Kai Li , Yuxiao Chen , Deep A Patel , Martin Renqiang Min , Yu Kong

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Classification Artificial Intelligence > Learning Paradigms > Zero-Shot Learning

Keywords

zero-shot learning spatial localization vision-language model temporal localization action detection open vocabulary detection

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025