Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Shuang Li; Yilun Du; Antonio Torralba; Josef Sivic; Bryan Russell

2021 ICCV ICCV 2021

Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Abstract

We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and object. To address these challenges, we introduce a contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision. To train our model, we introduce a dataset comprising over 6.5k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos. We demonstrate improved performance over weakly supervised baselines adapted to our task on our video dataset.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🧭 Keyword Pioneer — spatiotemporal region

🐣 Hot Topic Early Bird — human-object interaction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shuang Li , Yilun Du , Antonio Torralba , Josef Sivic , Bryan Russell

Topics

Machine Learning > Learning Types > Contrastive Learning Machine Learning > Learning Types > Weakly Supervised Learning Computer Vision > Analysis > Action Recognition

Keywords

contrastive learning weakly supervised learning video understanding human-object interaction spatiotemporal region

Download PDF

Related papers

Spatial-Temporal Transformer for Dynamic Scene Graph Generation 2021

ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators 2021

A Broad Study on the Transferability of Visual Representations With Contrastive Learning 2021

Query Adaptive Few-Shot Object Detection With Heterogeneous Graph Convolutional Networks 2021

Self-Supervised Neural Networks for Spectral Snapshot Compressive Imaging 2021