Stable Language Model Pre-training by Reducing Embedding Variability

Woojin Chung; Jiwoo Hong; Na Min An; James Thorne; Se-Young Yun

2024 EMNLP EMNLP 2024

Stable Language Model Pre-training by Reducing Embedding Variability

Abstract

AbstractStable pre-training is essential for achieving better-performing language models. However, tracking pre-training stability is impractical due to high computational costs. We study Token Embedding Variability as a simple proxy to estimate pre-training stability. We theoretically and empirically demonstrate that Multi-head Low-Rank Attention acts as a fundamental approach to reducing instability. This is supported by empirical findings on variants on GPT-2, demonstrating improved stability and lower perplexities, even at deeper layer counts.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — pre-training stability

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Woojin Chung , Jiwoo Hong , Na Min An , James Thorne , Se-Young Yun

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Optimization & Theory > Neural Network Optimization Machine Learning > Optimization & Theory > Optimization Deep Learning > Architectures > Transformers Natural Language Processing > Generation > Language Modeling Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models Deep Learning > Optimization & Theory > Neural Network Optimization Deep Learning > Optimization & Theory > Optimization Deep Learning > Models > Language Models

Keywords

attention mechanism neural network optimization training stability language model pre-training language model pretraining token embedding pre-training stability embedding variability multi-head low-rank attention model perplexity

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024