Data Similarity is Not Enough to Explain Language Model Performance

Gregory Yauney; Emily Reif; David Mimno

2023 EMNLP EMNLP 2023

Data Similarity is Not Enough to Explain Language Model Performance

Abstract

AbstractLarge language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model’s pretraining data is assumed to be easier for that model. We test whether distributional and example-specific similarity measures (embedding-, token- and model-based) correlate with language model performance through a large-scale comparison of the Pile and C4 pretraining datasets with downstream benchmarks. Similarity correlates with performance for multilingual datasets, but in other benchmarks, we surprisingly find that similarity metrics are not correlated with accuracy or even each other. This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — performance correlation

🐣 Hot Topic Early Bird — distribution alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Gregory Yauney , Emily Reif , David Mimno

Topics

Machine Learning > Optimization & Theory > Learning Theory Natural Language Processing > Resources & Methods > Large Language Models

Keywords

language model downstream task distribution alignment pretraining datum performance correlation data similarity

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023