BabyLM Challenge: Experimenting with Self-Distillation and Reverse-Distillation for Language Model Pre-Training on Constrained Datasets

Aakarsh Nair; Alina Hancharova; Mayank Kumar; Ali Gharaee

2024 CONLL CoNLL 2024

BabyLM Challenge: Experimenting with Self-Distillation and Reverse-Distillation for Language Model Pre-Training on Constrained Datasets

Abstract

AbstractLanguage models (LMs) exhibit significant data inefficiency compared to human learners. A child is able to master language while consuming less than 100 million words of input, while language models require orders of magnitude more tokens during training. Our submission to the BabyLM Challenge utilizes a combination of self-distillation and reverse-distillation to train a sequence of ensemble models with improved training characteristics on a fixed-size 10 million-word dataset. Self-distillation is used to generate an ensemble of models of a certain fixed size, while reverse distillation is used to train a more expressive larger model from a previously trained generation of relatively smaller models, while largely preserving learned accuracy.We find that ensembles consisting of two smaller models and one identical born-again model serve as ideal ensembles for each trained generation of model size. We demonstrate that, although our method is not novel, it provides consistent and modest performance improvements on the BLiMP and GLUE benchmarks.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Aakarsh Nair , Alina Hancharova , Mayank Kumar , Ali Gharaee

Topics

Machine Learning > Application Areas > Knowledge Distillation Natural Language Processing > Generation > Language Modeling

Keywords

model compression ensemble learning knowledge distillation language modeling

Download PDF

Related papers

Lossy Context Surprisal Predicts Task-Dependent Patterns in Relative Clause Processing 2024

Global-Pruner: A Stable and Efficient Pruner for Retraining-Free Pruning of Encoder-Based Language Models 2024

Transformer verbatim in-context retrieval across time and scale 2024

EditEval: An Instruction-Based Benchmark for Text Improvements 2024

An Empirical Comparison of Vocabulary Expansion and Initialization Approaches For Language Models 2024