Putting the Object Back into Video Object Segmentation

Ho Kei Cheng; Seoung Wug Oh; Brian Price; Joon-Young Lee; Alexander Schwing

2024 CVPR CVPR 2024

Putting the Object Back into Video Object Segmentation

Abstract

We present Cutie a video object segmentation (VOS) network with object-level memory reading which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise especially in the presence of distractors resulting in lower performance in more challenging data. In contrast Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt hence Cutie). The object queries act as a high-level summary of the target object while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while being three times faster. Code is available at: hkchengrex.github.io/Cutie

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — object-level memory

🐣 Hot Topic Early Bird — memory mechanism

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Ho Kei Cheng , Seoung Wug Oh , Brian Price , Joon-Young Lee , Alexander Schwing

Topics

Deep Learning > Architectures > Transformers Computer Vision > Analysis > Object Tracking Computer Vision > Processing > Video Understanding Artificial Intelligence > Core AI > Computer Vision Computer Vision > Processing > Video Segmentation Computer Vision > Analysis > Object Segmentation Deep Learning > Models > Transformers Deep Learning > Learning Types > Multi-Modal Learning

Keywords

semantic segmentation video segmentation video understanding object tracking memory mechanism foreground-background separation video object segmentation query-based learning query-based transformer object query object-level memory memory reading query-based object transformer

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024