ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing

Mike Chow; Yang Wang; William Wang; Ayichew Hailu; Rohan Bopardikar; Bin Zhang; Jialiang Qu; David Meisner; Santosh Sonawane; Yunqi Zhang; Rodrigo Paim; Mack Ward; Ivor Huang; Matt McNally; Daniel Hodges; Zoltan Farkas; Caner Gocmen; Elvis Huang; Chunqiang Tang; Awarded Best Paper!

2024 OSDI OSDI 2024

ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing

Abstract

This paper presents ServiceLab, a large-scale performance testing platform developed at Meta. Currently, the diverse set of applications and ML models it tests consumes millions of machines in production, and each year it detects performance regressions that could otherwise lead to the wastage of millions of machines. A major challenge for ServiceLab is to detect small performance regressions, sometimes as tiny as 0.01%. These minor regressions matter due to our large fleet size and their potential to accumulate over time. For instance, the median regression detected by ServiceLab for our large serverless platform, running on more than half a million machines, is only 0.14%. Another challenge is running performance tests in our private cloud, which, like the public cloud, is a noisy environment that exhibits inherent performance variances even for machines of the same instance type. To address these challenges, we conduct a large-scale study with millions of performance experiments to identify machine factors, such as the kernel, CPU, and datacenter location, that introduce variance to test results. Moreover, we present statistical analysis methods to robustly identify small regressions. Finally, we share our seven years of operational experience in dealing with a diverse set of applications.

👥 Mega-Team — 20 authors

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization

🧭 Keyword Pioneer — performance regression detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Healthcare & Medicine, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Topics

Machine Learning > Optimization & Theory > Statistical Learning Machine Learning > Application Areas > Efficient Computing Mathematics & Optimization > Optimization > Stochastic Methods

Keywords

statistical analysis distributed system cloud computing performance regression detection machine allocation

Download PDF

Related papers

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models 2024

SquirrelFS: using the Rust compiler to check file-system crash consistency 2024

μSlope: High Compression and Fast Search on Semi-Structured Logs 2024

Using Dynamically Layered Definite Releases for Verifying the RefFS File System 2024

DSig: Breaking the Barrier of Signatures in Data Centers 2024