A Closer Look at Parameter Contributions When Training Neural Language and Translation Models

Raúl Vázquez; Hande Celikkanat; Vinit Ravishankar; Mathias Creutz; Jörg Tiedemann

2022 COLING COLING 2022

A Closer Look at Parameter Contributions When Training Neural Language and Translation Models

Abstract

AbstractWe analyze the learning dynamics of neural language and translation models using Loss Change Allocation (LCA), an indicator that enables a fine-grained analysis of parameter updates when optimizing for the loss function. In other words, we can observe the contributions of different network components at training time. In this article, we systematically study masked language modeling, causal language modeling, and machine translation. We show that the choice of training objective leads to distinctive optimization procedures, even when performed on comparable Transformer architectures. We demonstrate how the various Transformer parameters are used during training, supporting that the feed-forward components of each layer are the main contributors to the optimization procedure. Finally, we find that the learning dynamics are not affected by data size and distribution but rather determined by the learning objective.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — parameter contribution

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Raúl Vázquez , Hande Celikkanat , Vinit Ravishankar , Mathias Creutz , Jörg Tiedemann

Topics

Machine Learning > Optimization & Theory > Neural Network Optimization Machine Learning > Application Areas > Knowledge Distillation Natural Language Processing > Generation > Language Modeling Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Generation > Machine Translation Deep Learning > Optimization & Theory > Neural Network Optimization Artificial Intelligence > Core AI > Language

Keywords

transformer architecture optimal transport knowledge distillation machine translation causal language modeling masked language modeling semantic distance loss change allocation parameter contribution

Download PDF

Related papers

MulZDG: Multilingual Code-Switching Framework for Zero-shot Dialogue Generation 2022

The Role of Context and Uncertainty in Shallow Discourse Parsing 2022

SelfMix: Robust Learning against Textual Label Noise with Self-Mixup Training 2022

Complicate Then Simplify: A Novel Way to Explore Pre-trained Models for Text Classification 2022

Repo4QA: Answering Coding Questions via Dense Retrieval on GitHub Repositories 2022