Frequency Balanced Datasets Lead to Better Language Models

Rodolfo Zevallos; Mireia Farrús; Núria Bel

2023 EMNLP EMNLP 2023

Frequency Balanced Datasets Lead to Better Language Models

Abstract

AbstractThis paper reports on the experiments aimed to improve our understanding of the role of the amount of data required for training attention-based transformer language models. Specifically, we investigate the impact of reducing the immense amounts of required pre-training data through sampling strategies that identify and reduce high-frequency tokens as different studies have indicated that the existence of very high-frequency tokens in pre-training data might bias learning, causing undesired effects. In this light, we describe our sampling algorithm that iteratively assesses token frequencies and removes sentences that contain still high-frequency tokens, eventually delivering a balanced, linguistically correct dataset. We evaluate the results in terms of model perplexity and fine-tuning linguistic probing tasks, NLP downstream tasks as well as more semantic SuperGlue tasks. The results show that pre-training with the resulting balanced dataset allows reducing up to three times the pre-training data.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rodolfo Zevallos , Mireia Farrús , Núria Bel

Topics

Machine Learning > Application Areas > Data Augmentation Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Natural Language Processing > Generation > Language Modeling Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Learning Types > Transfer Learning Deep Learning > Learning Types > Self-Supervised Learning Artificial Intelligence > Core AI > Language

Keywords

transformer architecture sampling strategy self-supervised learning language model data sampling token frequency pre-training datum

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023