CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

Shuyang Sun; Runjia Li; Philip Torr; Xiuye Gu; Siyang Li

2024 CVPR CVPR 2024

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

Abstract

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive which limits the number of categories in segmentation datasets. Consequently the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However without fine-tuning VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts but also those fine-tuned with millions of data samples and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely we improve the current record by 28.8 16.0 and 6.9 mIoU on Pascal VOC COCO Object and Pascal Context.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🧭 Keyword Pioneer — image-text supervision

🐣 Hot Topic Early Bird — open-vocabulary segmentation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shuyang Sun , Runjia Li , Philip Torr , Xiuye Gu , Siyang Li

Topics

Machine Learning > Learning Types > Zero-Shot Learning Computer Vision > Analysis > Semantic Segmentation

Keywords

zero-shot learning semantic segmentation open-vocabulary segmentation visual language model referring segmentation image-text supervision

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024