Instruction Pre-Training: Language Models are Supervised Multitask Learners

Daixuan Cheng; Yuxian Gu; Shaohan Huang; Junyu Bi; Minlie Huang; Furu Wei

2024 EMNLP EMNLP 2024

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Abstract

AbstractUnsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-training. In pre-training from scratch, Instruction Pre-training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — instruction pre-training

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Daixuan Cheng , Yuxian Gu , Shaohan Huang , Junyu Bi , Minlie Huang , Furu Wei

Topics

Machine Learning > Learning Types > Semi-Supervised Learning Deep Learning > Techniques > Pretraining Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Learning Types > Multi-Task Learning Deep Learning > Models > Large Language Models Deep Learning > Learning Types > Multi-Task Learning

Keywords

supervised learning multitask learning instruction tuning language model continual pre-training large language model multitask pre-training instruction pre-training supervised multitask learning

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024