INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models

H S V N S Kowndinya Renduchintala; Krishnateja Killamsetty; Sumit Bhatia; Milan Aggarwal; Ganesh Ramakrishnan; Rishabh Iyer; Balaji Krishnamurthy

2023 EMNLP EMNLP 2023

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models

Abstract

AbstractA salient characteristic of pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data. The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora and demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data. Further, we perform a rigorous empirical evaluation to show that the resulting models achieve up to ~99% of the performance of the fully-trained models. We made our framework publicly available at https://github.com/Efficient-AI/ingenious.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Mathematics & Optimization and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

H S V N S Kowndinya Renduchintala , Krishnateja Killamsetty , Sumit Bhatia , Milan Aggarwal , Ganesh Ramakrishnan , Rishabh Iyer , Balaji Krishnamurthy

Topics

Machine Learning > Optimization & Theory > Optimization Natural Language Processing > Resources & Methods > Large Language Models Mathematics & Optimization > Optimization > Combinatorial Optimization Machine Learning > Learning Paradigms > Transfer Learning Machine Learning > Learning Types > Transfer Learning Natural Language Processing > Resources & Methods > Language Modeling Artificial Intelligence > Core AI > Efficient Computing Deep Learning > Optimization & Theory > Optimization

Keywords

submodular optimization model compression data subset selection efficient computing computational efficiency representative subset selection language model data selection data efficiency pre-trained language model representative subset

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023