Modeling Multi-Label Action Dependencies for Temporal Action Localization

Praveen Tirupattur; Kevin Duarte; Yogesh S Rawat; Mubarak Shah

2021 CVPR CVPR 2021

Modeling Multi-Label Action Dependencies for Temporal Action Localization

Abstract

Real world videos contain many complex actions with inherent relationships between action classes. In this work, we propose an attention-based architecture that model these action relationships for the task of temporal action localization in untrimmed videos. As opposed to previous works which leverage video-level co-occurrence of actions, we distinguish the relationships between actions that occur at the same time-step and actions that occur at different time-steps (i.e. those which precede or follow each other). We define these distinct relationships as action dependencies. We propose to improve action localization performance by modeling these action dependencies in a novel attention based Multi-Label Action Dependency (MLAD) layer. The MLAD layer consists of two branches: a Co-occurrence Dependency Branch and a Temporal Dependency Branch to model co-occurrence action dependencies and temporal action dependencies, respectively. We observe that existing metrics used for multi-label classification do not explicitly measure how well action dependencies are modeled, therefore, we propose novel metrics which consider both co-occurrence and temporal dependencies between action classes. Through empirical evaluation and extensive analysis we show improved performance over state-of-the art methods on multi-label action localization benchmarks (MultiTHUMOS and Charades) in terms of f-mAP and our proposed metric.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — action dependencies

🐣 Hot Topic Early Bird — temporal action localization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Praveen Tirupattur , Kevin Duarte , Yogesh S Rawat , Mubarak Shah

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Classification Machine Learning > Learning Types > Weakly Supervised Learning Deep Learning > Architectures > Transformers Computer Vision > Analysis > Action Recognition Computer Vision > Analysis > Video Understanding

Keywords

action recognition attention mechanism multi-label classification video understanding temporal action localization action dependencies

Download PDF

Related papers

Learning To Reconstruct High Speed and High Dynamic Range Videos From Events 2021

DeFLOCNet: Deep Image Editing via Flexible Low-Level Controls 2021

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs 2021

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization 2021

Pose-Guided Human Animation From a Single Image in the Wild 2021