On Losses for Modern Language Models

Stéphane Aroca-Ouellette; Frank Rudzicz

2020 EMNLP EMNLP 2020

On Losses for Modern Language Models

Abstract

AbstractBERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP’s effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of which seven are novel to modern language models, and 3) investigate different ways to include multiple tasks into pre-training. We show that NSP is detrimental to training due to its context splitting and shallow semantic signal. We also identify six auxiliary pre-training tasks – sentence ordering, adjacent sentence prediction, TF prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant – that outperform a pure MLM baseline. Finally, we demonstrate that using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task. Using these methods, we outperform BERTBase on the GLUE benchmark using fewer than a quarter of the training tokens.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

📈 Trend Setter — Large Language Models

🧭 Keyword Pioneer — bert pre-training

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Stéphane Aroca-Ouellette , Frank Rudzicz

Topics

Machine Learning > Optimization & Theory > Loss Functions Artificial Intelligence > Core AI > Large Language Models Deep Learning > Learning Types > Representation Learning Deep Learning > Techniques > Fine-Tuning Machine Learning > Learning Types > Large Language Models

Keywords

natural language understanding auxiliary task masked language modeling next sentence prediction bert pre-training pre-training task sentence ordering multi-task pre-training

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020