Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

Sravanti Addepalli; Ashish Ramayee Asokan; Lakshay Sharma; R. Venkatesh Babu

2024 CVPR CVPR 2024

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

Abstract

Vision-Language Models (VLMs) such as CLIP are trained on large amounts of image-text pairs resulting in remarkable generalization across several data distributions. However in several cases their expensive training and data collection/curation costs do not justify the end application. This motivates a vendor-client paradigm where a vendor trains a large-scale VLM and grants only input-output access to clients on a pay-per-query basis in a black-box setting. The client aims to minimize inference cost by distilling the VLM to a student model using the limited available task-specific data and further deploying this student model in the downstream application. While naive distillation largely improves the In-Domain (ID) accuracy of the student it fails to transfer the superior out-of-distribution (OOD) generalization of the VLM teacher using the limited available labeled images. To mitigate this we propose Vision-Language to Vision - Align Distill Predict (VL2V-ADiP) which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model and further distills the aligned VLM representations to the student. This maximally retains the pre-trained features of the student while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings. The proposed approach achieves state-of-the-art results on the standard Domain Generalization benchmarks in a black-box teacher setting as well as a white-box setting where the weights of the VLM are accessible.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sravanti Addepalli , Ashish Ramayee Asokan , Lakshay Sharma , R. Venkatesh Babu

Topics

Machine Learning > Application Areas > Domain Generalization Machine Learning > Application Areas > Knowledge Distillation Computer Vision > Core AI > Multimodal Learning Machine Learning > Learning Types > Domain Generalization Deep Learning > Learning Types > Knowledge Distillation Deep Learning > Learning Types > Domain Adaptation

Keywords

image classification domain generalization knowledge distillation out-of-distribution generalization vision-language model black-box setting student model

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024