Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Shaoyi Huang; Dongkuan Xu; Ian Yen; Yijue Wang; Sung-En Chang; Bingbing Li; Shiyang Chen; Mimi Xie; Sanguthevar Rajasekaran; Hang Liu; Caiwen Ding

2022 ACL ACL 2022

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Abstract

AbstractConventional wisdom in pruning Transformer-based language models is that pruning reduces the model expressiveness and thus is more likely to underfit rather than overfit. However, under the trending pretrain-and-finetune paradigm, we postulate a counter-traditional hypothesis, that is: pruning increases the risk of overfitting when performed at the fine-tuning phase. In this paper, we aim to address the overfitting problem and improve pruning performance via progressive knowledge distillation with error-bound properties. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Ablation studies and experiments on the GLUE benchmark show that our method outperforms the leading competitors across different tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — progressive knowledge distillation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shaoyi Huang , Dongkuan Xu , Ian Yen , Yijue Wang , Sung-En Chang , Bingbing Li , Shiyang Chen , Mimi Xie , Sanguthevar Rajasekaran , Hang Liu , Caiwen Ding

Topics

Artificial Intelligence > Core AI > Model Compression Machine Learning > Application Areas > Knowledge Distillation Machine Learning > Application Areas > Model Compression Deep Learning > Optimization & Theory > Model Compression Deep Learning > Techniques > Knowledge Distillation

Keywords

knowledge distillation model pruning overfitting reduction transformer-based language model progressive knowledge distillation pretrain-and-finetune paradigm

Download PDF

KG-CRuSE: Recurrent Walks over Knowledge Graph for Explainable Conversation Reasoning using Semantic Embeddings 2022

Toward Knowledge-Enriched Conversational Recommendation Systems 2022

Investigating the Medical Coverage of a Translation System into Pictographs for Patients with an Intellectual Disability 2022

TableFormer: Robust Transformer Modeling for Table-Text Encoding 2022

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Abstract

Authors

Topics

Keywords

Related papers