Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE

Kshiteej Mahajan; Ching-Hsiang Chu; Srinivas Sridharan; Aditya Akella

2023 NSDI NSDI 2023

Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE

Abstract

Emerging ML training deployments are trending towards larger models, and hybrid-parallel training that is not just dominated by compute-intensive all-reduce for gradient aggregation but also bandwidth-intensive collectives (e.g., all-to-all). These emerging collectives exacerbate the communication bottlenecks despite heterogeneous network interconnects with ample multipath opportunities. In this work, we propose SYNDICATE, a systematic, general framework to minimize communication bottlenecks and speed up training for both state-of-the-art and future large-scale models and interconnects. SYNDICATE proposes a novel abstraction, the motif, to break large communication work as smaller pieces as part of execution planning. SYNDICATE also does joint optimization of scheduling and execution planning by rethinking the interfaces in the networking systems stacks used for ML training. Motifs afford greater flexibility during scheduling and the joint optimizer exploits this flexibility by packing and ordering communication work so as to maximize both network utilization and overlap with compute. This improves the speed of training state-of-the-art large models by 21-74%.

🧭 Keyword Pioneer — collective scheduling

🐣 Hot Topic Early Bird — distributed learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning

🌉 Interdisciplinary Bridge — Computer Science and Deep Learning and Machine Learning

Authors

Kshiteej Mahajan , Ching-Hsiang Chu , Srinivas Sridharan , Aditya Akella

Topics

Machine Learning > Optimization & Theory > Distributed Learning Machine Learning > Optimization & Theory > Optimization Machine Learning > Application Areas > Efficient Computing Computer Science > Systems > Distributed Systems Deep Learning > Optimization & Theory > Optimization

Keywords

distributed learning collective scheduling execution planning communication bottleneck hybrid-parallel training network utilization

Download PDF

Related papers

Scalable Tail Latency Estimation for Data Center Networks 2023

Acoustic Sensing and Communication Using Metasurface 2023

Enabling Users to Control their Internet 2023

Flattened Clos: Designing High-performance Deadlock-free Expander Data Center Networks Using Graph Contraction 2023

RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics 2023