Scaling Distributed Machine Learning with In-Network Aggregation

Amedeo Sapio; Marco Canini; Chen-Yu Ho; Jacob Nelson; Panos Kalnis; Changhoon Kim; Arvind Krishnamurthy; Masoud Moshref; Dan Ports; Peter Richtarik

2021 NSDI NSDI 2021

Scaling Distributed Machine Learning with In-Network Aggregation

Abstract

Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide an efficient solution that speeds up training by up to 5.5 times for a number of real-world benchmark models.

🧭 Keyword Pioneer — communication primitive

🐣 Hot Topic Early Bird — model training

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Amedeo Sapio , Marco Canini , Chen-Yu Ho , Jacob Nelson , Panos Kalnis , Changhoon Kim , Arvind Krishnamurthy , Masoud Moshref , Dan Ports , Peter Richtarik

Topics

Machine Learning > Optimization & Theory > Distributed Learning

Keywords

distributed machine learning model training programmable switch in-network aggregation communication primitive

Download PDF

Related papers

Programming Network Stack for Middleboxes with Rubik 2021

SyncScatter: Enabling WiFi like synchronization and range for WiFi backscatter Communication 2021

Avenir: Managing Data Plane Diversity with Control Plane Synthesis 2021

Cost-effective Cloud Edge Traffic Engineering with Cascara 2021

Debugging Transient Faults in Data Centers using Synchronized Network-wide Packet Histories 2021