On Modular Learning of Distributed Systems for Predicting End-to-End Latency

Chieh-Jan Mike Liang; Zilin Fang; Yuqing Xie; Fan Yang; Zhao Lucis Li; Li Lyna Zhang; Mao Yang; Lidong Zhou

2023 NSDI NSDI 2023

On Modular Learning of Distributed Systems for Predicting End-to-End Latency

Abstract

An emerging trend in cloud deployments is to adopt machine learning (ML) models to characterize end-to-end system performance. Despite early success, such methods can incur significant costs when adapting to the deployment dynamics of distributed systems like service scaling-out and replacement. They require hours or even days for data collection and model training, otherwise models may drift to result in unacceptable inaccuracy. This problem arises from the practice of modeling the entire system with monolithic models. We propose Fluxion, a framework to model end-to-end system latency with modularized learning. Fluxion introduces learning assignment, a new abstraction that allows modeling individual sub-components. With a consistent interface, multiple learning assignments can then be dynamically composed into an inference graph, to model a complex distributed system on the fly. Changes in a system sub-component only involve updating the corresponding learning assignment, thus significantly reducing costs. Using three systems with up to 142 microservices on a 100-VM cluster, Fluxion shows a performance modeling MAE (mean absolute error) up to 68.41% lower than monolithic models. In turn, this lower MAE allows better system performance tuning, e.g., a speed up for 90-percentile end-to-end latency by up to 1.57×. All these are achieved under various system deployment dynamics.

🧭 Keyword Pioneer — end-to-end latency

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chieh-Jan Mike Liang , Zilin Fang , Yuqing Xie , Fan Yang , Zhao Lucis Li , Li Lyna Zhang , Mao Yang , Lidong Zhou

Topics

Machine Learning > Core Methods > Regression Machine Learning > Application Areas > Efficient Computing

Keywords

distributed system performance prediction modular learning mean absolute error end-to-end latency

Download PDF

Related papers

Scalable Tail Latency Estimation for Data Center Networks 2023

Acoustic Sensing and Communication Using Metasurface 2023

Enabling Users to Control their Internet 2023

Flattened Clos: Designing High-performance Deadlock-free Expander Data Center Networks Using Graph Contraction 2023

RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics 2023