2025 WACV WACV 2025

Boosting Semi-Supervised Video Action Detection with Temporal Context

Abstract

This paper studies semi-supervised learning of video action detection (VAD) which assumes that only a small portion of training videos are labeled and the others remain unlabeled. The existing semi-supervised methods for VAD mainly focus on leveraging spatial context of unlabeled video lacking its exploration of temporal context. To resolve this we present a novel semi-supervised learning framework that effectively incorporates spatio-temporal context during training. We first introduce a new augmentation strategy called temporal cross-view augmentation to achieve robust representation across clips depicting the same action but not aligned on the time axis. We also propose a new context fusion method called global-local context fusion that effectively utilizes the spatio-temporal context of videos to enhances the features of each frame by incorporating those of other frames within a clip; this method aids in actively leveraging spatio-temporal context of video leading to significant performance improvement. Our framework was evaluated on UCF101-24 and JHMDB-21 where it outperformed all existing methods in every evaluation setting.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio