Action Recognition by Hierarchical Mid-Level Action Elements

Tian Lan; Yuke Zhu; Amir Roshan Zamir; Silvio Savarese

2015 ICCV ICCV 2015

Action Recognition by Hierarchical Mid-Level Action Elements

Abstract

Realistic videos of human actions exhibit rich spatiotemporal structures at multiple levels of granularity: an action can always be decomposed into multiple finer-grained elements in both space and time. To capture this intuition, we propose to represent videos by a hierarchy of mid-level action elements (MAEs), where each MAE corresponds to an action-related spatiotemporal segment in the video. We introduce an unsupervised method to generate this representation from videos. Our method is capable of distinguishing action-related segments from background segments and representing actions at multiple spatiotemporal resolutions. Given a set of spatiotemporal segments generated from the training data, we introduce a discriminative clustering algorithm that automatically discovers MAEs at multiple levels of granularity. We develop structured models that capture a rich set of spatial, temporal and hierarchical relations among the segments, where the action label and multiple levels of MAE labels are jointly inferred. The proposed model achieves state-of-the-art performance in multiple action recognition benchmarks. Moreover, we demonstrate the effectiveness of our model in real-world applications such as action recognition in large-scale untrimmed videos and action parsing.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🧭 Keyword Pioneer — mid-level action element

🐣 Hot Topic Early Bird — hierarchical model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tian Lan , Yuke Zhu , Amir Roshan Zamir , Silvio Savarese

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Unsupervised Learning Computer Vision > Analysis > Action Recognition

Keywords

action recognition discriminative clustering hierarchical model video representation mid-level action element spatiotemporal segmentation

Download PDF

Related papers

Cutting Edge: Soft Correspondences in Multimodal Scene Parsing 2015

Unsupervised Generation of a Viewpoint Annotated Car Dataset From Videos 2015

Depth-Based Hand Pose Estimation: Data, Methods, and Challenges 2015

Peeking Template Matching for Depth Extension 2015

Learning Semi-Supervised Representation Towards a Unified Optimization Framework for Semi-Supervised Learning 2015