Preparing Lessons for Progressive Training on Language Models

Yu Pan; Ye Yuan; Yichun Yin; Jiaxin Shi; Zenglin Xu; Ming Zhang; Lifeng Shang; Xin Jiang; Qun Liu

2024 AAAI AAAI 2024

Preparing Lessons for Progressive Training on Language Models

Abstract

Abstract The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prepares lessons for expanding operations by learning high-layer functionality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yu Pan , Ye Yuan , Yichun Yin , Jiaxin Shi , Zenglin Xu , Ming Zhang , Lifeng Shang , Xin Jiang , Qun Liu

Topics

Machine Learning > Optimization & Theory > Neural Network Optimization Machine Learning > Application Areas > Efficient Computing Deep Learning > Techniques > Model Architecture Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Efficient Computing Deep Learning > Optimization & Theory > Optimization Deep Learning > Optimization & Theory > Efficient Computing Deep Learning > Models > Language Models

Keywords

transformer architecture transfer learning neural network optimization efficient computing weight sharing progressive training model efficiency resource consumption layer stacking model extension

Download PDF

Related papers

Goal Alignment: Re-analyzing Value Alignment Problems Using Human-Aware AI 2024

Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables 2024

Suppressing Uncertainty in Gaze Estimation 2024

Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation 2024

Heterogeneous Test-Time Training for Multi-Modal Person Re-identification 2024