StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Dongliang He; Zhichao Zhou; Chuang Gan; Fu Li; Xiao Liu; Yandong Li; Limin Wang; Shilei Wen

2019 AAAI AAAI 2019

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Abstract

Abstract Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatialtemporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatialtemporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

🚀 Conference Pioneer — AAAI 2019

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dongliang He , Zhichao Zhou , Chuang Gan , Fu Li , Xiao Liu , Yandong Li , Limin Wang , Shilei Wen

Topics

Deep Learning > Architectures > Neural Networks Computer Vision > Analysis > Action Recognition Computer Vision > Processing > Video Processing Deep Learning > Architectures > Convolutional Neural Networks

Keywords

neural network architecture action recognition video understanding convolutional neural network spatial-temporal modeling temporal convolution

Download PDF

Related papers

Cooperative Multimodal Approach to Depression Detection in Twitter 2019

Learning to Align Question and Answer Utterances in Customer Service Conversation with Recurrent Pointer Networks 2019

Community Detection in Social Networks Considering Topic Correlations 2019

Session-Based Recommendation with Graph Neural Networks 2019

Blameworthiness in Multi-Agent Settings 2019