BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

Qizhen Zhang; Nikolas Gritsch; Dwaraknath Gnaneshwar; Simon Guo; David Cairuz; Bharat Venkitesh; Jakob Foerster; Phil Blunsom; Sebastian Ruder; Ahmet Üstün; Acyr Locatelli

2024 NIPS NeurIPS 2024

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

Abstract

Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance compared to dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Previous work addresses this challenge by independently training multiple dense expert models and using them to initialize an MoE. In particular, state-of-the-art approaches initialize MoE layers using experts' feed-forward parameters while merging all other parameters, limiting the advantages of the specialized dense models when upcycling them as MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective improvement to MoE training. BAM makes full use of specialized dense models by not only using their feed-forward network (FFN) to initialize the MoE layers but also leveraging experts' attention weights fully by leveraging them as mixture-of-attention (MoA) layers. We explore two methods for upcycling MoA layers: 1) initializing separate attention experts from dense models including key, value, and query matrices; and 2) initializing only Q projections while sharing key-value pairs across all experts to facilitate efficient inference. Our experiments using seed models ranging from 590 million to 2 billion parameters show that our approach outperforms state-of-the-art approaches under the same data and compute budget in both perplexity and downstream tasks evaluations, confirming the effectiveness of BAM.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🧭 Keyword Pioneer — parameter upcycling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Qizhen Zhang , Nikolas Gritsch , Dwaraknath Gnaneshwar , Simon Guo , David Cairuz , Bharat Venkitesh , Jakob Foerster , Phil Blunsom , Sebastian Ruder , Ahmet Üstün , Acyr Locatelli

Topics

Artificial Intelligence > Core AI > Model Compression Deep Learning > Architectures > Transformers Deep Learning > Architectures > Neural Networks Deep Learning > Models > Generative Models Machine Learning > Application Areas > Model Compression Machine Learning > Core Methods > Model Compression Machine Learning > Learning Types > Knowledge Distillation Deep Learning > Learning Types > Knowledge Distillation Deep Learning > Learning Types > Model Compression

Keywords

model compression attention mechanism knowledge distillation efficient inference perplexity evaluation mixture of expert feed-forward network parameter upcycling attention weight dense model model initialization large language model

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024