Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation

Fei Gui; Kaihui Gao; Li Chen; Dan Li; Vincent Liu; Ran Zhang; Hongbing Yang; Dian Xiong

2025 NSDI NSDI 2025

Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation

Abstract

The rapid expansion of large language models (LLMs) requires the development of extensive GPU clusters, with companies deploying clusters with tens to hundreds of thousands of GPUs. This growth significantly expands the design space for LLM training systems, requiring thorough exploration of different parallelization strategies, communication parameters, congestion control, fabric topology, etc. Current methods require up to 10k simulation experiments to identify optimal configurations, with inadequate exploration leading to significant degradation of training performance. In this paper, we tackle the overlooked problem of efficiently conducting parallel simulation experiments for design space exploration. Our analysis and experiments show that Single-process Multi-experiment (SPME) achieves superior performance by reducing scheduling overhead and optimizing resource utilization, yet remains insufficient for current AI cluster scales. To enhance SPME’s efficacy, we introduce Multiverse, a novel GPU-based AI training simulator. Multiverse leverages the computing throughput of GPUs efficiently with optimizations such as a pull-based synchronization, highfidelity intra-server communication, and a kernel-fusion technique. Extensive experiments validate the accuracy and efficiency of Multiverse, demonstrating less than 3.0% discrepancy with real-world LLM training on clusters of up to 54,000 GPUs, achieving 43.1−73.2X speedup over state-of-the-art CPU-based simulators in various use cases.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — gpu-based simulator

🐝 Cross-Pollinator — Artificial Intelligence, Deep Learning, Machine Learning, Reinforcement Learning

Authors

Fei Gui , Kaihui Gao , Li Chen , Dan Li , Vincent Liu , Ran Zhang , Hongbing Yang , Dian Xiong

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Optimization & Theory > Distributed Learning Machine Learning > Application Areas > Efficient Computing

Keywords

design space exploration parallel simulation gpu-based simulator llm training system multi-experiment parallelism

Download PDF

Related papers

Building an Elastic Block Storage over EBOFs Using Shadow Views 2025

Unlocking ECMP Programmability for Precise Traffic Control 2025

Scaling IP Lookup to Large Databases using the CRAM Lens 2025

AsTree: An Audio Subscription Architecture Enabling Massive-Scale Multi-Party Conferencing 2025

Eden: Developer-Friendly Application-Integrated Far Memory 2025