No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques

Tanmay Gupta; Alexander Schwing; Derek Hoiem

2019 ICCV ICCV 2019

No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques

Abstract

We show that for human-object interaction detection a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors outperforms more sophisticated approaches. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose). We also develop training techniques that improve learning efficiency by: (1) eliminating a train-inference mismatch; (2) rejecting easy negatives during mini-batch training; and (3) using a ratio of negatives to positives that is two orders of magnitude larger than existing approaches. We conduct a thorough ablation study to understand the importance of different factors and training techniques using the challenging HICO-Det dataset.

🧭 Keyword Pioneer — layout encoding

🐣 Hot Topic Early Bird — human-object interaction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tanmay Gupta , Alexander Schwing , Derek Hoiem

Topics

Computer Vision > Analysis > Action Recognition Computer Vision > Analysis > Object Detection

Keywords

pose estimation object detection human-object interaction layout encoding

Download PDF

Related papers

Hierarchical Self-Attention Network for Action Localization in Videos 2019

StructureFlow: Image Inpainting via Structure-Aware Appearance Flow 2019

Overcoming Catastrophic Forgetting With Unlabeled Data in the Wild 2019

Compact Trilinear Interaction for Visual Question Answering 2019

A2J: Anchor-to-Joint Regression Network for 3D Articulated Pose Estimation From a Single Depth Image 2019