Global Capacity Management With Flux

Marius Eriksen; Kaushik Veeraraghavan; Yusuf Abdulghani; Andrew Birchall; Po-Yen Chou; Richard Cornew; Adela Kabiljo; Ranjith Kumar S; Maroo Lieuw; Justin Meza; Scott Michelson; Thomas Rohloff; Hayley Russell; Jeff Qin; Chunqiang Tang

2023 OSDI OSDI 2023

Global Capacity Management With Flux

Abstract

Customers of both private and public cloud providers must wrestle with the problem of regionalization: how should service capacity be apportioned across a large number of geo-distributed datacenter regions? This problem is further complicated by the complex service dependency graphs that arise from microservice architectures, as well as capacity availability and hardware mix that can vary greatly by region. Historically, regionalization has been solved through a slow-moving and manual process, whereby owners of large services directly negotiate capacity allocation and distribution with the cloud provider. However, as both service and cloud footprints continue to grow, these manual processes are becoming untenable, and tend to produce both a great amount of toil for everyone involved, as well as suboptimal results. At Meta we have built a system, Flux, to automate capacity regionalization, moving it from a bottoms-up, manual process, to a top-down, automated one. Flux employs RPC tracing to identify service capacity models, and uses these to compute an optimal joint capacity and traffic distribution plan that spans 1000s of services across 10s of products, and involves millions of servers. These plans are orchestrated by a system that safely and efficiently rebalances service capacity and product traffic across 10s of regions on a continuous basis.

🧭 Keyword Pioneer — service optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Marius Eriksen , Kaushik Veeraraghavan , Yusuf Abdulghani , Andrew Birchall , Po-Yen Chou , Richard Cornew , Adela Kabiljo , Ranjith Kumar S , Maroo Lieuw , Justin Meza , Scott Michelson , Thomas Rohloff , Hayley Russell , Jeff Qin , Chunqiang Tang

Topics

Computer Science > Systems > Distributed Systems Computer Science > Applications > Software Engineering

Keywords

distributed system capacity management service optimization resource orchestration traffic distribution

Download PDF

Related papers

EINNET: Optimizing Tensor Programs with Derivation-Based Transformations 2023

Triangulating Python Performance Issues with SCALENE 2023

Accountable authentication with privacy protection: The Larch system for universal login 2023

ExoFlow: A Universal Workflow System for Exactly-Once DAGs 2023

Conveyor: One-Tool-Fits-All Continuous Software Deployment at Meta 2023