Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters

Wenzheng Zhang; Yang Hu; Jing Shi; Xiaoying Bai

2025 AAAI AAAI 2025

Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters

Abstract

Abstract Scaling Deep Neural Networks (DNNs) requires significant computational resources in terms of GPU quantity and compute capacity. In practice, there usually exists a large number of heterogeneous GPU devices due to the rapid release cycle of GPU products. It is highly needed to efficiently and economically harness the power of heterogeneous GPUs, so that it can meet the requirements of DNN research and development. The paper introduces Poplar, a distributed training system that extends Zero Redundancy Optimizer (ZeRO) with heterogeneous-aware capabilities. We explore a broader spectrum of GPU heterogeneity, including compute capability, memory capacity, quantity and a combination of them. In order to achieve high computational efficiency across all heterogeneous conditions, Poplar conducts fine-grained measurements of GPUs in each ZeRO stage. We propose a novel batch allocation method and a search algorithm to optimize the utilization of heterogeneous GPUs clusters. Furthermore, Poplar implements fully automated parallelism, eliminating the need for deploying heterogeneous hardware and finding suitable batch size. Extensive experiments on three heterogeneous clusters, comprising six different types of GPUs, demonstrate that Poplar achieves a training throughput improvement of 1.02-3.92x over current state-of-the-art heterogeneous training systems.

🌉 Interdisciplinary Bridge — Computer Science and Deep Learning and Machine Learning

🧭 Keyword Pioneer — heterogeneous gpus

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wenzheng Zhang , Yang Hu , Jing Shi , Xiaoying Bai

Topics

Machine Learning > Optimization & Theory > Distributed Learning Machine Learning > Application Areas > Efficient Computing Deep Learning > Techniques > Model Architecture Computer Science > Systems > Distributed Systems Deep Learning > Optimization & Theory > Efficient Computing

Keywords

model parallelism distributed training data parallelism gpu cluster heterogeneous gpus gradient optimizer batch allocation

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025