SV-data2vec: Guiding Video Representation Learning with Latent Skeleton Targets

Zorana Doždor; Tomislav Hrkac; Zoran Kalafatic

2025 WACV WACV 2025

SV-data2vec: Guiding Video Representation Learning with Latent Skeleton Targets

Abstract

Recent advancements in action recognition leverage both skeleton and video modalities to achieve state-of-the-art performance. However due to the challenges of early fusion which tends to underutilize the strengths of each modality existing methods often resort to late fusion consequently leading to more complex designs. Additionally self-supervised learning approaches utilizing both modalities remain underexplored. In this paper we introduce SV-data2vec a novel self-supervised framework for learning from skeleton and video data. Our approach employs a student-teacher architecture where the teacher network generates contextualized targets based on skeleton data. The student network performs a masked prediction task using both skeleton and visual data. Remarkably after pretraining with both modalities our method allows for fine-tuning with RGB data alone achieving results on par with multimodal approaches by effectively learning video representations through skeleton data guidance. Extensive experiments on benchmark datasets NTU RGB+D 60 NTU RGB+D 120 and Toyota Smarthome confirm that our method outperforms existing RGB based state-of-the-art techniques. The code is available at github.com/zoranadozdor/SVdata2vec.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zorana Doždor , Tomislav Hrkac , Zoran Kalafatic

Topics

Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Analysis > Action Recognition Deep Learning > Learning Types > Self-Supervised Learning Artificial Intelligence > Learning Paradigms > Self-Supervised Learning

Keywords

action recognition self-supervised learning multimodal learning masked prediction student-teacher architecture video representation learning skeleton datum

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025