PolyFormer: Referring Image Segmentation As Sequential Polygon Generation

Jiang Liu; Hui Ding; Zhaowei Cai; Yuting Zhang; Ravi Kumar Satzoda; Vijay Mahadevan; R. Manmatha

2023 CVPR CVPR 2023

PolyFormer: Referring Image Segmentation As Sequential Polygon Generation

Abstract

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — polygon generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiang Liu , Hui Ding , Zhaowei Cai , Yuting Zhang , Ravi Kumar Satzoda , Vijay Mahadevan , R. Manmatha

Topics

Deep Learning > Architectures > Transformers Computer Vision > Processing > Image Segmentation Computer Vision > Processing > Semantic Segmentation Artificial Intelligence > Core AI > Computer Vision

Keywords

transformer architecture semantic segmentation sequential prediction autoregressive model referring image segmentation polygon generation

Download PDF

Related papers

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching 2023

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars 2023

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos 2023

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement 2023

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata 2023