SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

Brendan Duke; Abdalla Ahmed; Christian Wolf; Parham Aarabi; Graham W. Taylor

2021 CVPR CVPR 2021

SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

Abstract

In this paper we introduce a Transformer-based approach to video object segmentation (VOS). To address compounding error and scalability issues of prior work, we propose a scalable, end-to-end method for VOS called Sparse Spatiotemporal Transformers (SST). SST extracts per-pixel representations for each object in a video using sparse attention over spatiotemporal features. Our attention-based formulation for VOS allows a model to learn to attend over a history of multiple frames and provides suitable inductive bias for performing correspondence-like computations necessary for solving motion segmentation. We demonstrate the effectiveness of attention-based over recurrent networks in the spatiotemporal domain. Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness to occlusions compared with the state of the art. Code is available at https://github.com/dukebw/SSTVOS.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — correspondence computation

🐣 Hot Topic Early Bird — sparse attention

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Brendan Duke , Abdalla Ahmed , Christian Wolf , Parham Aarabi , Graham W. Taylor

Topics

Deep Learning > Architectures > Transformers Computer Vision > Analysis > Video Understanding Computer Vision > Analysis > Object Segmentation

Keywords

visual object tracking sparse attention video object segmentation spatiotemporal transformer correspondence computation

Download PDF

Related papers

Learning To Reconstruct High Speed and High Dynamic Range Videos From Events 2021

DeFLOCNet: Deep Image Editing via Flexible Low-Level Controls 2021

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs 2021

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization 2021

Pose-Guided Human Animation From a Single Image in the Wild 2021