Spatiotemporal Multiplier Networks for Video Action Recognition

Christoph Feichtenhofer; Axel Pinz; Richard P. Wildes

2017 CVPR CVPR 2017

Spatiotemporal Multiplier Networks for Video Action Recognition

Abstract

This paper presents a general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features. Our model combines the appearance and motion pathways of a two-stream architecture by motion gating and is trained end-to-end. We theoretically motivate multiplicative gating functions for residual networks and empirically study their effect on classification accuracy. To capture long-term dependencies we inject identity mapping kernels for learning temporal relationships. Our architecture is fully convolutional in spacetime and able to evaluate a video in a single forward pass. Empirical investigation reveals that our model produces state-of-the-art results on two standard action recognition datasets.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — temporal relationship

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Christoph Feichtenhofer , Axel Pinz , Richard P. Wildes

Topics

Computer Vision > Analysis > Action Recognition Computer Vision > Processing > Video Understanding Deep Learning > Techniques > Attention Deep Learning > Architectures > Convolutional Neural Networks

Keywords

action recognition convolutional neural network residual network temporal relationship spatiotemporal feature video action recognition two-stream architecture motion gating multiplicative gating

Download PDF

Related papers

Deep Outdoor Illumination Estimation 2017

SRN: Side-output Residual Network for Object Symmetry Detection in the Wild 2017

Weakly Supervised Semantic Segmentation Using Web-Crawled Videos 2017

FASON: First and Second Order Information Fusion Network for Texture Recognition 2017

Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization 2017