Objects2action: Classifying and Localizing Actions Without Any Video Example

Mihir Jain; Jan C. van Gemert; Thomas Mensink; Cees G. M. Snoek

2015 ICCV ICCV 2015

Objects2action: Classifying and Localizing Actions Without Any Video Example

Abstract

The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach.

🐣 Hot Topic Early Bird — zero-shot learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mihir Jain , Jan C. van Gemert , Thomas Mensink , Cees G. M. Snoek

Topics

Machine Learning > Core Methods > Embedding Learning Machine Learning > Learning Types > Zero-Shot Learning

Keywords

zero-shot learning action recognition skip-gram model semantic embedding word embedding

Download PDF

Related papers

Cutting Edge: Soft Correspondences in Multimodal Scene Parsing 2015

Unsupervised Generation of a Viewpoint Annotated Car Dataset From Videos 2015

Depth-Based Hand Pose Estimation: Data, Methods, and Challenges 2015

Peeking Template Matching for Depth Extension 2015

Learning Semi-Supervised Representation Towards a Unified Optimization Framework for Semi-Supervised Learning 2015