Weakly Supervised Action Selection Learning in Video

Junwei Ma; Satya Krishna Gorti; Maksims Volkovs; Guangwei Yu

2021 CVPR CVPR 2021

Weakly Supervised Action Selection Learning in Video

Abstract

Localizing actions in video is a core task in computer vision. The weakly supervised temporal localization problem investigates whether this task can be adequately solved with only video-level labels, significantly reducing the amount of expensive and error-prone annotation that is required. A common approach is to train a frame-level classifier where frames with the highest class probability are selected to make a video-level prediction. Frame-level activations are then used for localization. However, the absence of frame-level annotations cause the classifier to impart class bias on every frame. To address this, we propose the Action Selection Learning (ASL) approach to capture the general concept of action, a property we refer to as "actionness". Under ASL, the model is trained with a novel class-agnostic task to predict which frames will be selected by the classifier. Empirically, we show that ASL outperforms leading baselines on two popular benchmarks THUMOS-14 and ActivityNet-1.2, with 10.3% and 5.7% relative improvement respectively. We further analyze the properties of ASL and demonstrate the importance of actionness. Full code for this work is available here https://github.com/layer6ai-labs/ASL

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — weakly supervised action localization

🐣 Hot Topic Early Bird — temporal action localization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Junwei Ma , Satya Krishna Gorti , Maksims Volkovs , Guangwei Yu

Topics

Machine Learning > Learning Types > Weakly Supervised Learning Computer Vision > Analysis > Action Recognition Deep Learning > Learning Types > Weakly Supervised Learning

Keywords

weakly supervised learning video understanding action localization temporal action localization frame-level classification temporal localization video-level annotation weakly supervised action localization

Download PDF

Related papers

Learning To Reconstruct High Speed and High Dynamic Range Videos From Events 2021

DeFLOCNet: Deep Image Editing via Flexible Low-Level Controls 2021

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs 2021

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization 2021

Pose-Guided Human Animation From a Single Image in the Wild 2021