Do Pre-Trained Models Benefit Equally in Continual Learning?
Abstract
A large part of the continual learning (CL) literature focuses on developing algorithms for models trained from scratch. While these algorithms work great with from-sc ratch trained models on widely used CL benchmarks, they show dramatic performance drops on more complex datasets (e.g., Split-CUB200). Pre-trained models, widely used to transfer knowledge to downstream tasks, could enhance these methods to be applicable in more realistic scenarios. However, surprisingly, improvements in CL algorithms from pre-training are inconsistent. For instance, while Incremental Classifier and Representation Learning (iCaRL) underperforms Supervised Contrastive Replay (SCR) when trained from scratch, it outperforms SCR when both are initialized with a pre-trained model. This indicates the paradigm current CL literature follows, where all methods are compared in from-scratch training, is not well reflective of the true CL objective and desired progress. Furthermore, we found 1) CL algorithms that exert less regularization benefit more from a pre-trained model; 2) a model pre-trained with a larger dataset (WebImageText in Contrastive Language-Image Pre-training (CLIP) vs. ImageNet) does not guarantee a better improvement. Based on these findings, we introduced a simple yet effective baseline that employs minimum regularization and leverages the more beneficial pre-trained model, which outperforms state-of-the-art methods when pre-training is applied. Our code is available at https://github.com/eric11220/pretrained-models-in-CL.