OLMoE: Open Mixture-of-Experts Language Models

Niklas Muennighoff; Luca Soldaini; Dirk Groeneveld; Kyle Lo; Jacob Morrison; Sewon Min; Weijia Shi; Evan Pete Walsh; Oyvind Tafjord; Nathan Lambert; Yuling Gu; Shane Arora; Akshita Bhagia; Dustin Schwenk; David Wadden; Alexander Wettig; Binyuan Hui; Tim Dettmers; Douwe Kiela; Ali Farhadi; Noah A. Smith; Pang Wei Koh; Amanpreet Singh; Hannaneh Hajishirzi

2025 ICLR ICLR 2025

OLMoE: Open Mixture-of-Experts Language Models

Abstract

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present novel findings on MoE training, define and analyze new routing properties showing high specialization in our model, and open-source all our work: model weights, training data, code, and logs.

👥 Mega-Team — 24 authors

Authors

Niklas Muennighoff , Luca Soldaini , Dirk Groeneveld , Kyle Lo , Jacob Morrison , Sewon Min , Weijia Shi , Evan Pete Walsh , Oyvind Tafjord , Nathan Lambert , Yuling Gu , Shane Arora , Akshita Bhagia , Dustin Schwenk , David Wadden , Alexander Wettig , Binyuan Hui , Tim Dettmers , Douwe Kiela , Ali Farhadi , Noah A. Smith , Pang Wei Koh , Amanpreet Singh , Hannaneh Hajishirzi

Download PDF

Related papers

Gramian Multimodal Representation Learning and Alignment 2025

Separation Power of Equivariant Neural Networks 2025

What should a neuron aim for? Designing local objective functions based on information theory 2025

Regret-Optimal List Replicable Bandit Learning: Matching Upper and Lower Bounds 2025

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL 2025