Robust and Consistent Online Video Instance Segmentation via Instance Mask Propagation

Miran Heo; Seoung Wug Oh; Seon Joo Kim; Joon-Young Lee

2025 AAAI AAAI 2025

Robust and Consistent Online Video Instance Segmentation via Instance Mask Propagation

Abstract

Abstract Recent advancements in online Video Instance Segmentation (VIS) methods show notable performance improvements across benchmarks. However, the leading methods in the tracking-by-detection paradigm often result in temporally inconsistent predictions at both instance-level and pixel-level that lead to visually unsatisfactory outcomes. To address these challenges, we propose RoCoVIS, a simple yet effective approach that integrates segmentation and tracking to provide consistent online VIS. Our approach is an end-to-end sequential learning where object queries are propagated through mask predictions, improving the accuracy of temporal instance mapping at the pixel level. Additionally, we propose a new label assignment criterion in harmony with our approach. We also examine the limitations and challenges presented by the current standard evaluation protocol (AP) and suggest adopting additional metrics, Tube-Boundary AP and AP_Pool. RoCoVIS demonstrates superior performance on challenging VIS benchmarks with a Swin-L backbone and shows competitive results when employing a ResNet-50 backbone. By employing Tube-Boundary AP and AP_Pool as metrics to measure mask accuracy and consistency, RoCoVIS outperforms its counterpart, GenVIS, on the HQ-YTVIS and VIPSeg.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — instance mask propagation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Miran Heo , Seoung Wug Oh , Seon Joo Kim , Joon-Young Lee

Topics

Computer Vision > Analysis > Object Detection Computer Vision > Analysis > Object Tracking Computer Vision > Processing > Video Understanding Computer Vision > Analysis > Video Understanding Artificial Intelligence > Core AI > Computer Vision Computer Vision > Analysis > Object Segmentation

Keywords

video understanding video instance segmentation object tracking temporal consistency query-based detection instance mask propagation mask propagation

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025