ATP: In-network Aggregation for Multi-tenant Learning

ChonLam Lao; Yanfang Le; Kshiteej Mahajan; Yixi Chen; Wenfei Wu; Aditya Akella; Michael Swift

2021 NSDI NSDI 2021

ATP: In-network Aggregation for Multi-tenant Learning

Abstract

Distributed deep neural network training (DT) systems are widely deployed in clusters where the network is shared across multiple tenants, i.e., multiple DT jobs. Each DT job computes and aggregates gradients. Recent advances in hardware accelerators have shifted the the performance bottleneck of training from computation to communication. To speed up DT jobs' communication, we propose ATP, a service for in-network aggregation aimed at modern multi-rack, multi-job DT settings. ATP uses emerging programmable switch hardware to support in-network aggregation at multiple rack switches in a cluster to speedup DT jobs. ATP performs decentralized, dynamic, best-effort aggregation, enables efficient and equitable sharing of limited switch resources across simultaneously running DT jobs, and gracefully accommodates heavy contention for switch resources. ATP outperforms existing systems accelerating training throughput by up to 38% - 66% in a cluster shared by multiple DT jobs.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — in-network aggregation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

ChonLam Lao , Yanfang Le , Kshiteej Mahajan , Yixi Chen , Wenfei Wu , Aditya Akella , Michael Swift

Topics

Machine Learning > Optimization & Theory > Distributed Learning Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Neural Networks

Keywords

neural network training gradient aggregation distributed learning programmable switch in-network aggregation

Download PDF

Related papers

Programming Network Stack for Middleboxes with Rubik 2021

SyncScatter: Enabling WiFi like synchronization and range for WiFi backscatter Communication 2021

Scaling Distributed Machine Learning with In-Network Aggregation 2021

Avenir: Managing Data Plane Diversity with Control Plane Synthesis 2021

Cost-effective Cloud Edge Traffic Engineering with Cascara 2021