On the Analysis and Distillation of Emergent Outlier Properties in Pre-trained Language Models
Abstract
AbstractA small subset of dimensions within language Transformers’ representation spaces emerge as “outliers” during pretraining, encoding critical knowledge sparsely. We extend previous findings on emergent outliers to Encoder-Decoder Transformers and instruction-finetuned models, and tackle the problem of distilling a student Transformer from a larger teacher Transformer. Knowledge distillation reduces model size and cost by transferring knowledge from a larger teacher to a smaller student, necessitating a trade-off among representation dimensions. We show that emergent outlier dimensions contribute significantly more to zero-shot performance than non-outlier dimensions. Based on this, we propose the Emergent Outlier Focused Distillation (EOFD) method, which prioritizes critical outlier dimensions in distillation using a weighted MSE loss. We empirically demonstrate that EOFD outperforms state-of-the-art distillation methods and generalizes well across Encoder-only BERT, Decoder-only GPT-2, and Encoder-Decoder T5 architectures.