Masking as an Efficient Alternative to Finetuning for Pretrained Language Models

Mengjie Zhao; Tao LIN; Fei Mi; Martin Jaggi; Hinrich Schütze

2020 EMNLP EMNLP 2020

Masking as an Efficient Alternative to Finetuning for Pretrained Language Models

Abstract

AbstractWe present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. Extensive evaluations of masking BERT, RoBERTa, and DistilBERT on eleven diverse NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several tasks need to be inferred. Intrinsic evaluations show that representations computed by our binary masked language models encode information necessary for solving downstream tasks. Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. This confirms that masking can be utilized as an efficient alternative to finetuning.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — loss landscape analysis

🐣 Hot Topic Early Bird — parameter efficient

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mengjie Zhao , Tao LIN , Fei Mi , Martin Jaggi , Hinrich Schütze

Topics

Machine Learning > Application Areas > Efficient Computing Machine Learning > Application Areas > Model Merging Natural Language Processing > Resources & Methods > Large Language Models Deep Learning > Optimization & Theory > Model Compression Deep Learning > Techniques > Fine-Tuning

Keywords

model compression bert model parameter efficient downstream task parameter-efficient tuning memory footprint pretrained language model binary mask loss landscape analysis binary masking

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020