SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

Bin Xie; Jiale Cao; Jin Xie; Fahad Shahbaz Khan; Yanwei Pang

2024 CVPR CVPR 2024

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

Abstract

Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories. Most existing methods explore utilizing pre-trained vision-language models in which the key is to adopt the image-level model for pixel-level segmentation task. In this paper we propose a simple encoder-decoder named SED for open-vocabulary semantic segmentation which comprises a hierarchical encoder-based cost map generation and a gradual fusion decoder with category early rejection. The hierarchical encoder-based cost map generation employs hierarchical backbone instead of plain transformer to predict pixel-level image-text cost map. Compared to plain transformer hierarchical backbone better captures local spatial information and has linear computational complexity with respect to input size. Our gradual fusion decoder employs a top-down structure to combine cost map and the feature maps of different backbone levels for segmentation. To accelerate inference speed we introduce a category early rejection scheme in the decoder that rejects many no-existing categories at the early layer of decoder resulting in at most 4.7 times acceleration without accuracy degradation. Experiments are performed on multiple open-vocabulary semantic segmentation datasets which demonstrates the efficacy of our SED method. When using ConvNeXt-B our SED method achieves mIoU score of 31.6% on ADE20K with 150 categories at 82 millisecond (ms) per image on a single A6000. Our source code is available at https://github.com/xb534/SED.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — hierarchical backbone

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Bin Xie , Jiale Cao , Jin Xie , Fahad Shahbaz Khan , Yanwei Pang

Topics

Machine Learning > Application Areas > Domain Adaptation Deep Learning > Architectures > Transformers Computer Vision > Processing > Image Segmentation Computer Vision > Processing > Semantic Segmentation Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Techniques > Transfer Learning

Keywords

semantic segmentation vision-language model open-vocabulary learning cost map hierarchical backbone

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024