BiT: Robustly Binarized Multi-distilled Transformer

Zechun Liu; Barlas Oguz; Aasish Pappu; Lin Xiao; Scott Yih; Meng Li; Raghuraman Krishnamoorthi; Yashar Mehdad

2022 NIPS NeurIPS 2022

BiT: Robustly Binarized Multi-distilled Transformer

Abstract

Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however, is technically challenging from an optimization perspective. In this work, we identify a series of improvements that enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%. Code and models are available at:https://github.com/facebookresearch/bit.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐣 Hot Topic Early Bird — weight quantization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zechun Liu , Barlas Oguz , Aasish Pappu , Lin Xiao , Scott Yih , Meng Li , Raghuraman Krishnamoorthi , Yashar Mehdad

Topics

Machine Learning > Application Areas > Efficient Computing Machine Learning > Application Areas > Knowledge Distillation Deep Learning > Architectures > Transformers Machine Learning > Application Areas > Model Compression Deep Learning > Optimization & Theory > Model Compression Deep Learning > Techniques > Knowledge Distillation

Keywords

neural network compression model quantization knowledge distillation weight quantization model binarization activation quantization

Download PDF

Related papers

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching 2022

A Theoretical View on Sparsely Activated Networks 2022

Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks 2022

Matryoshka Representation Learning 2022

Off-Policy Evaluation with Deficient Support Using Side Information 2022