2025 NSDI NSDI 2025

Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation

Abstract

The rapid expansion of large language models (LLMs) requires the development of extensive GPU clusters, with companies deploying clusters with tens to hundreds of thousands of GPUs. This growth significantly expands the design space for LLM training systems, requiring thorough exploration of different parallelization strategies, communication parameters, congestion control, fabric topology, etc. Current methods require up to 10k simulation experiments to identify optimal configurations, with inadequate exploration leading to significant degradation of training performance. In this paper, we tackle the overlooked problem of efficiently conducting parallel simulation experiments for design space exploration. Our analysis and experiments show that Single-process Multi-experiment (SPME) achieves superior performance by reducing scheduling overhead and optimizing resource utilization, yet remains insufficient for current AI cluster scales. To enhance SPME’s efficacy, we introduce Multiverse, a novel GPU-based AI training simulator. Multiverse leverages the computing throughput of GPUs efficiently with optimizations such as a pull-based synchronization, highfidelity intra-server communication, and a kernel-fusion technique. Extensive experiments validate the accuracy and efficiency of Multiverse, demonstrating less than 3.0% discrepancy with real-world LLM training on clusters of up to 54,000 GPUs, achieving 43.1βˆ’73.2X speedup over state-of-the-art CPU-based simulators in various use cases.

πŸŒ‰ Interdisciplinary Bridge β€” Artificial Intelligence and Machine Learning
🧭 Keyword Pioneer β€” gpu-based simulator
🐝 Cross-Pollinator β€” Artificial Intelligence, Deep Learning, Machine Learning, Reinforcement Learning