Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

Nataliya Shapovalova; Michalis Raptis; Leonid Sigal; Greg Mori

2013 NIPS NeurIPS 2013

Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

Abstract

We propose a new weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach we develop a generalization of the Max-Path search algorithm, which allows us to efficiently search over a structured space of multiple spatio-temporal paths, while also allowing to incorporate context information into the model. Instead of using spatial annotations, in the form of bounding boxes, to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating gaze, along with the classification, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classification, and achieves state-of-the-art results in localization. In addition, we show how our model can produce top-down saliency maps conditioned on the classification label and localized latent paths.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

📈 Trend Setter — Semantic Segmentation

🧭 Keyword Pioneer — spatio-temporal localization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🌱 Topic Pioneer — Weakly Supervised Learning

🐣 Hot Topic Early Bird — weakly-supervised learning

Authors

Nataliya Shapovalova , Michalis Raptis , Leonid Sigal , Greg Mori

Topics

Machine Learning > Learning Types > Weakly Supervised Learning Computer Vision > Analysis > Action Recognition Computer Vision > Analysis > Semantic Segmentation Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Weakly Supervised Learning

Keywords

action recognition weakly supervised learning video understanding eye tracking weakly-supervised learning structured learning action localization spatio-temporal localization human gaze latent svm eye gaze tracking video action recognition spatio-temporal action localization eye gaze latent structured learning

Download PDF

Related papers

Latent Structured Active Learning 2013

On Flat versus Hierarchical Classification in Large-Scale Taxonomies 2013

Generalized Method-of-Moments for Rank Aggregation 2013

Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections 2013

Accelerated Mini-Batch Stochastic Dual Coordinate Ascent 2013