Actor and Action Video Segmentation From a Sentence

Kirill Gavrilyuk; Amir Ghodrati; Zhenyang Li; Cees G. M. Snoek

2018 CVPR CVPR 2018

Actor and Action Video Segmentation From a Sentence

Abstract

This paper strives for pixel-level segmentation of actors and their actions in video content. Different from existing works, which all learn to segment from a fixed vocabulary of actor and action pairs, we infer the segmentation from a natural language input sentence. This allows to distinguish between fine-grained actors in the same super-category, identify actor and action instances, and segment pairs that are outside of the actor and action vocabulary. We propose a fully-convolutional model for pixel-level actor and action segmentation using an encoder-decoder architecture optimized for video. To show the potential of actor and action video segmentation from a sentence, we extend two popular actor and action datasets with more than 7,500 natural language descriptions. Experiments demonstrate the quality of the sentence-guided segmentations, the generalization ability of our model, and its advantage for traditional actor and action segmentation compared to the state-of-the-art.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — language-guided segmentation

🐣 Hot Topic Early Bird — natural language

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Kirill Gavrilyuk , Amir Ghodrati , Zhenyang Li , Cees G. M. Snoek

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Action Recognition Computer Vision > Processing > Semantic Segmentation Computer Vision > Processing > Video Segmentation Deep Learning > Learning Types > Zero-Shot Learning

Keywords

zero-shot learning semantic segmentation video segmentation multimodal learning video understanding natural language action segmentation natural language interface language-guided segmentation fully-convolutional network actor segmentation actor action segmentation

Download PDF

Related papers

Multi-Shot Pedestrian Re-Identification via Sequential Decision Making 2018

Multi-Cue Correlation Filters for Robust Visual Tracking 2018

Pointwise Convolutional Neural Networks 2018

Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking 2018

Image Generation From Scene Graphs 2018