Speech2Action: Cross-Modal Supervision for Action Recognition

Arsha Nagrani; Chen Sun; David Ross; Rahul Sukthankar; Cordelia Schmid; Andrew Zisserman

2020 CVPR CVPR 2020

Speech2Action: Cross-Modal Supervision for Action Recognition

Abstract

Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Arsha Nagrani , Chen Sun , David Ross , Rahul Sukthankar , Cordelia Schmid , Andrew Zisserman

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Weakly Supervised Learning Computer Vision > Analysis > Action Recognition Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

action recognition speech recognition speech processing cross-modal learning weak supervision video clip

Download PDF

Related papers

Deep Polarization Cues for Transparent Object Segmentation 2020

HRank: Filter Pruning Using High-Rank Feature Map 2020

Panoptic-Based Image Synthesis 2020

Select, Supplement and Focus for RGB-D Saliency Detection 2020

ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings 2020