Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Haoran Xu; Maha Elbayad; Kenton Murray; Jean Maillard; Vedanuj Goswami

2023 EMNLP EMNLP 2023

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Abstract

AbstractMixture-of-experts (MoE) models that employ sparse activation have demonstrated effectiveness in significantly increasing the number of parameters while maintaining low computational requirements per token. However, recent studies have established that MoE models are inherently parameter-inefficient as the improvement in performance diminishes with an increasing number of experts. We hypothesize this parameter inefficiency is a result of all experts having equal capacity, which may not adequately meet the varying complexity requirements of different tokens or tasks. In light of this, we propose Stratified Mixture of Experts (SMoE) models, which feature a stratified structure and can assign dynamic capacity to different tokens. We demonstrate the effectiveness of SMoE on three multilingual machine translation benchmarks, containing 4, 15, and 94 language pairs, respectively. We show that SMoE outperforms multiple state-of-the-art MoE models with the same or fewer parameters.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Haoran Xu , Maha Elbayad , Kenton Murray , Jean Maillard , Vedanuj Goswami

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Model Architecture Natural Language Processing > Applications > Machine Translation Machine Learning > Application Areas > Model Compression Natural Language Processing > Generation > Machine Translation Deep Learning > Optimization & Theory > Efficient Computing Machine Learning > Learning Types > Multi-Objective Optimization

Keywords

transformer architecture machine translation multilingual translation multilingual machine translation mixture of expert sparse activation parameter efficiency multilingual model dynamic capacity

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023