Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

Jeongseok Hyun; Su Ho Han; Hyolim Kang; Joon-Young Lee; Seon Joo Kim

2025 WACV WACV 2025

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

Abstract

The vocabulary size in temporal action localization (TAL) is limited by the scarcity of large-scale annotated datasets. To overcome this recent works integrate vision-language models (VLMs) such as CLIP for open-vocabulary TAL (OV-TAL). However despite the success of VLMs trained on extensive datasets existing OV-TAL methods still rely on human-labeled TAL datasets of limited size to train action localizers limiting their generalizability. In this paper we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our approach consists of two stages: (1) a class-agnostic action localizer is trained on a human-labeled TAL dataset to generate pseudo-labels for unlabeled videos and (2) the large-scale pseudo-labeled dataset is then used to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly enhances the generalizability of an action localizer. Additionally we identify limitations in existing OV-TAL evaluation schemes and propose a new benchmark for thorough assessment. Finally we showcase the TAL performance of the large multimodal model Gemini-1.5 on our new benchmark. Code is released at https://github.com/HYUNJS/STOV-TAL.

🧭 Keyword Pioneer — action localizer

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jeongseok Hyun , Su Ho Han , Hyolim Kang , Joon-Young Lee , Seon Joo Kim

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Unsupervised Learning

Keywords

vision-language model temporal action localization action localizer

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025