Large-scale Video Classification with Convolutional Neural Networks

Andrej Karpathy; George Toderici; Sanketh Shetty; Thomas Leung; Rahul Sukthankar; Li Fei-fei

2014 CVPR CVPR 2014

Large-scale Video Classification with Convolutional Neural Networks

Abstract

Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — spatio-temporal network

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Andrej Karpathy , George Toderici , Sanketh Shetty , Thomas Leung , Rahul Sukthankar , Li Fei-fei

Topics

Deep Learning > Architectures > Neural Networks Computer Vision > Analysis > Action Recognition Computer Vision > Processing > Video Understanding Computer Vision > Analysis > Video Understanding Deep Learning > Architectures > Convolutional Neural Networks

Keywords

action recognition video classification video understanding convolutional neural network spatio-temporal network

Download PDF

Related papers

Efficient Nonlinear Markov Models for Human Motion 2014

Occlusion Geodesics for Online Multi-Object Tracking 2014

A Principled Approach for Coarse-to-Fine MAP Inference 2014

Locally Optimized Product Quantization for Approximate Nearest Neighbor Search 2014

Fast and Accurate Image Matching with Cascade Hashing for 3D Reconstruction 2014