2025 AAAI AAAI 2025

Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation

Abstract

Abstract Deep learning models often suffer from performance degradation in unseen domains, posing a risk for safety-critical applications such as autonomous driving. To tackle this problem, recent studies have leveraged pre-trained Visual Foundation Models (VFMs) to enhance generalization. However, exsiting works mainly focus on designing intricate networks for VFMs, neglecting their inherent strong generalization potential. Moreover, these methods typically perform inference on low-resolution images. The loss of detail hinders accurate predictions in unseen domains, especially for small objects. In this paper, we argue that simply fine-tuning VFMs and leveraging high-resolution images unleash the power of VFMs for generalizable semantic segmentation. Therefore, we design a VFM-based segmentation network (VFMNet) that adapts VFMs to this task with minimal fine-tuning, preserving their generalizable knowledge. Then, to fully utilize high-resolution images, we train a Mask-guided Refinement Network (MGRNet) to refine VFMNet's predictions combining detailed image features. Furthermore, we adopt a two-stage coarse-to-fine inference approach. MGRNet is used to refine the low-confidence regions predicted by VFMNet to obtain fine-grained results. Extensive experiments demonstrate the effectiveness of our method, outperforming state-of-the-art methods by 3.3% on the average mIoU in synthetic-to-real domain generalization.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio