Global Capacity Management With Flux
Abstract
Customers of both private and public cloud providers must wrestle with the problem of regionalization: how should service capacity be apportioned across a large number of geo-distributed datacenter regions? This problem is further complicated by the complex service dependency graphs that arise from microservice architectures, as well as capacity availability and hardware mix that can vary greatly by region. Historically, regionalization has been solved through a slow-moving and manual process, whereby owners of large services directly negotiate capacity allocation and distribution with the cloud provider. However, as both service and cloud footprints continue to grow, these manual processes are becoming untenable, and tend to produce both a great amount of toil for everyone involved, as well as suboptimal results. At Meta we have built a system, Flux, to automate capacity regionalization, moving it from a bottoms-up, manual process, to a top-down, automated one. Flux employs RPC tracing to identify service capacity models, and uses these to compute an optimal joint capacity and traffic distribution plan that spans 1000s of services across 10s of products, and involves millions of servers. These plans are orchestrated by a system that safely and efficiently rebalances service capacity and product traffic across 10s of regions on a continuous basis.