AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Zhuohan Li; Lianmin Zheng; Yinmin Zhong; Vincent Liu; Ying Sheng; Xin Jin; Yanping Huang; zhifeng Chen; Hao Zhang; Joseph E. Gonzalez; Ion Stoica

2023 OSDI OSDI 2023

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Abstract

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10× higher rates or 6× more burstiness while staying within latency constraints for more than 99% of requests.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — statistical multiplexing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing

Authors

Zhuohan Li , Lianmin Zheng , Yinmin Zhong , Vincent Liu , Ying Sheng , Xin Jin , Yanping Huang , zhifeng Chen , Hao Zhang , Joseph E. Gonzalez , Ion Stoica

Topics

Machine Learning > Optimization & Theory > Distributed Learning Machine Learning > Application Areas > Efficient Computing Deep Learning > Techniques > Model Architecture

Keywords

model parallelism statistical multiplexing serving latency bursty workload distributed cluster

Download PDF

Related papers

EINNET: Optimizing Tensor Programs with Derivation-Based Transformations 2023

Triangulating Python Performance Issues with SCALENE 2023

Accountable authentication with privacy protection: The Larch system for universal login 2023

ExoFlow: A Universal Workflow System for Exactly-Once DAGs 2023

Conveyor: One-Tool-Fits-All Continuous Software Deployment at Meta 2023