2021 NSDI NSDI 2021

When Cloud Storage Meets RDMA

Abstract

Pangu is a cloud storage developed by Alibaba. Since its inception in 2009, it served and is still serving most core businesses of Alibaba, e.g., e-business and online payment. A cloud storage is expected to achieve high performance, high reliability and high stability simultaneously. Recent rapid progress of storage medium makes networking a major performance bottleneck for new generations of cloud storage. Remote Direct Memory Access (RDMA) running on lossless Ethernet is the most promising answer for network bottleneck in cloud storage. In this paper, we share our experience on introducingRDMAintoPangu'sstoragenetworks. We design a fabric, taking performance, reliability and stability into consideration together. For performance optimization, Pangu builds a software framework that integrates RDMA with its private storage protocol stack. For reliability guarantee, Pangu uses RDMA/TCP switching as a final resort. For stability improvement, Pangu uses intensive monitoring and parameter tuning for fail-over. Till the submission time, RDMA-enabled Pangu has successfully served many online mission-critical services for over three years, including several important shopping festivals.

👥 Mega-Team — 24 authors
🧭 Keyword Pioneer — storage network
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning