Beyond Short Snippets: Deep Networks for Video Classification
Abstract
Convolutional neural networks (CNNs) have been exten- sively applied for image recognition problems giving state- of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image infor- mation across a video over longer time periods than previ- ously attempted. We propose two methods capable of han- dling full length videos. The first method explores various convolutional temporal feature pooling architectures, ex- amining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improve- ments over previously published results on the Sports 1 mil- lion dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.2% vs. 87.9%) and without additional optical flow information (82.6% vs. 72.8%).