MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

Qizhen Weng; Wencong Xiao; Yinghao Yu; Wei Wang; Cheng Wang; Jian He; Yong Li; Liping Zhang; Wei LIN; Yu Ding

2022 NSDI NSDI 2022

MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

Abstract

With the sustained technological advances in machine learning (ML) and the availability of massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) clouds, often with heterogeneous GPUs, to provision a host of ML applications. However, running diverse ML workloads in heterogeneous GPU clusters raises a number of challenges. In this paper, we present a characterization study of a two-month workload trace collected from a production MLaaS cluster with over 6,000 GPUs in Alibaba. We explain the challenges posed to cluster scheduling, including the low GPU utilization, the long queueing delays, the presence of hard-to-schedule tasks demanding high-end GPUs with picky scheduling requirements, the imbalance load across heterogeneous machines, and the potential bottleneck on CPUs. We describe our current solutions and call for further investigations into the challenges that remain open to address. We have released the trace for public access, which is the most comprehensive in terms of the workloads and cluster scale.

🌉 Interdisciplinary Bridge — Computer Science and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Qizhen Weng , Wencong Xiao , Yinghao Yu , Wei Wang , Cheng Wang , Jian He , Yong Li , Liping Zhang , Wei LIN , Yu Ding

Topics

Machine Learning > Application Areas > Efficient Computing Computer Science > Systems > Distributed Systems

Keywords

heterogeneous computing gpu utilization queueing delay cluster scheduling resource utilization workload scheduling gpu cluster gpu scheduling heterogeneous cluster machine learning workload machine learning service

Download PDF

Related papers

Starlight: Fast Container Provisioning on the Edge and over the WAN 2022

PowerTCP: Pushing the Performance Limits of Datacenter Networks 2022

Gearbox: A Hierarchical Packet Scheduler for Approximate Weighted Fair Queuing 2022

Characterizing Physical-Layer Transmission Errors in Cable Broadband Networks 2022

Orca: Server-assisted Multicast for Datacenter Networks 2022