Sea-CLIP: Mining Semantic-Aware Representations for Few-Shot Anomaly Detection with CLIP
Abstract
Few-shot Anomaly Detection (FSAD) is a classic computer vision task, and recent FSAD methods utilize the pre-trained Vision-Language model, i.e., CLIP, to achieve remarkable performance. However, existing CLIP-based approaches disregard object semantics, a crucial factor for enhancing FSAD by guiding comparisons between semantically corresponding patches. To address this limitation, we propose Sea-CLIP, a novel method that integrates semantic-aware representations from DINOv2 to enhance FSAD representation learning. Specifically, Sea-CLIP first leverages a Patch Matching module that uses semantic-aware representations to obtain coarse anomaly segmentation masks. These anomaly masks guide a lightweight Anomaly Matching Decoder (AMD) to utilize CLIP and DINOv2 features for FSAD jointly, and AMD innovatively formulates FSAD as a feature-matching task. Also, unlike prior patch-matching works that directly compute anomaly scores, our method utilizes the AMD to refine coarse predictions into a precise anomaly mask. Our Sea-CLIP achieves state-of-the-art FSAD performance on MVTec and VisA datasets, and we provide a detailed analysis of contributions from semantic-aware representations in identifying anomaly patterns.