Scaling Instruction-Finetuned Language Models

Hyung Won Chung; Le Hou; Shayne Longpre; Barret Zoph; Yi Tay; William Fedus; Yunxuan Li; Xuezhi Wang; Mostafa Dehghani; Siddhartha Brahma; Albert Webson; Shixiang Shane Gu; Zhuyun Dai; Mirac Suzgun; Xinyun Chen; Aakanksha Chowdhery; Alex Castro-Ros; Marie Pellat; Kevin Robinson; Dasha Valter; Sharan Narang; Gaurav Mishra; Adams Yu; Vincent Zhao; Yanping Huang; Andrew Dai; Hongkun Yu; Slav Petrov; Ed H. Chi; Jeff Dean; Jacob Devlin; Adam Roberts; Denny Zhou; Quoc V. Le; Jason Wei

2024 JMLR JMLR 2024

Scaling Instruction-Finetuned Language Models

Abstract

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation, RealToxicityPrompts). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks (at time of release), such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models. [abs] [ pdf ][ bib ] © JMLR 2024. (edit, beta)

👥 Mega-Team — 35 authors

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hyung Won Chung , Le Hou , Shayne Longpre , Barret Zoph , Yi Tay , William Fedus , Yunxuan Li , Xuezhi Wang , Mostafa Dehghani , Siddhartha Brahma , Albert Webson , Shixiang Shane Gu , Zhuyun Dai , Mirac Suzgun , Xinyun Chen , Aakanksha Chowdhery , Alex Castro-Ros , Marie Pellat , Kevin Robinson , Dasha Valter , Sharan Narang , Gaurav Mishra , Adams Yu , Vincent Zhao , Yanping Huang , Andrew Dai , Hongkun Yu , Slav Petrov , Ed H. Chi , Jeff Dean , Jacob Devlin , Adam Roberts , Denny Zhou , Quoc V. Le , Jason Wei

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Zero-Shot Learning

Keywords

few-shot learning language model model scaling instruction finetuning

Download PDF

Related papers

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks 2024

Convergence for nonconvex ADMM, with applications to CT imaging 2024

Functional Directed Acyclic Graphs 2024

Sum-of-norms clustering does not separate nearby balls 2024

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning 2024