MOVES: Manipulated Objects in Video Enable Segmentation

Richard E. L. Higgins; David F. Fouhey

2023 CVPR CVPR 2023

MOVES: Manipulated Objects in Video Enable Segmentation

Abstract

We present a method that uses manipulation to learn to understand the objects people hold and as well as hand-object contact. We train a system that takes a single RGB image and produces a pixel-embedding that can be used to answer grouping questions (do these two pixels go together) as well as hand-association questions (is this hand holding that pixel). Rather painstakingly annotate segmentation masks, we observe people in realistic video data. We show that pairing epipolar geometry with modern optical flow produces simple and effective pseudo-labels for grouping. Given people segmentations, we can further associate pixels with hands to understand contact. Our system achieves competitive results on hand and hand-held object tasks.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Richard E. L. Higgins , David F. Fouhey

Topics

Computer Vision > Analysis > Object Detection Computer Vision > Analysis > Object Tracking Computer Vision > Processing > Video Processing Computer Vision > Processing > Video Understanding Computer Vision > Domain-Specific > Egocentric Vision Computer Vision > Analysis > Video Understanding Computer Vision > Analysis > Object Segmentation

Keywords

video segmentation egocentric vision video understanding hand pose estimation object segmentation optical flow hand-object interaction epipolar geometry pixel embedding object grouping

Download PDF

Related papers

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching 2023

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars 2023

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos 2023

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement 2023

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata 2023