Delving Deep Into the Generalization of Vision Transformers Under Distribution Shifts

Chongzhi Zhang; Mingyuan Zhang; Shanghang Zhang; Daisheng Jin; Qiang Zhou; Zhongang Cai; Haiyu Zhao; Xianglong Liu; Ziwei Liu

2022 CVPR CVPR 2022

Delving Deep Into the Generalization of Vision Transformers Under Distribution Shifts

Abstract

Recently, Vision Transformers have achieved impressive results on various Vision tasks. Yet, their generalization ability under different distribution shifts is poorly understood. In this work, we provide a comprehensive study on the out-of-distribution generalization of Vision Transformers. To support a systematic investigation, we first present a taxonomy of distribution shifts by categorizing them into five conceptual levels: corruption shift, background shift, texture shift, destruction shift, and style shift. Then we perform extensive evaluations of Vision Transformer variants under different levels of distribution shifts and compare their generalization ability with Convolutional Neural Network (CNN) models. Several important observations are obtained: 1) Vision Transformers generalize better than CNNs under multiple distribution shifts. With the same or less amount of parameters, Vision Transformers are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of distribution shift. In particular, Vision Transformers lead by more than 10% under the corruption shifts. 2) larger Vision Transformers gradually narrow the in-distribution (ID) and out-of-distribution (OOD) performance gap. To further improve the generalization of Vision Transformers, we design the enhanced Vision Transformers through self-supervised learning, information theory, and adversarial learning. By investigating these three types of generalization-enhanced Transformers, we observe the gradient-sensitivity of Vision Transformers and design a smoother learning strategy to achieve a stable training process. With modified training schemes, we achieve improvements on performance towards out-of-distribution data by 4% from vanilla Vision Transformers. We comprehensively compare these three types of generalization-enhanced Vision Transformers with their corresponding CNN models and observe that: 1) For the enhanced model, larger Vision Transformers still benefit more from the

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chongzhi Zhang , Mingyuan Zhang , Shanghang Zhang , Daisheng Jin , Qiang Zhou , Zhongang Cai , Haiyu Zhao , Xianglong Liu , Ziwei Liu

Topics

Machine Learning > Learning Types > Adversarial Learning Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Application Areas > Domain Generalization

Keywords

adversarial learning vision transformer self-supervised learning out-of-distribution generalization distribution shift

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022