Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Zhiyu Zhao; Bingkun Huang; Sen Xing; Gangshan Wu; Yu Qiao; Limin Wang

2024 CVPR CVPR 2024

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Abstract

Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically taking inspiration from knowledge distillation in model compression we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy where the teacher model is enabled to see more context information with a lower masking ratio while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhiyu Zhao , Bingkun Huang , Sen Xing , Gangshan Wu , Yu Qiao , Limin Wang

Topics

Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Machine Learning > Application Areas > Model Compression Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Models > Transformers Deep Learning > Techniques > Knowledge Distillation Deep Learning > Learning Types > Transfer Learning

Keywords

model compression feature alignment representation learning vision transformer self-supervised learning knowledge distillation foundation model masked autoencoder masked autoencoding

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024