Gandiva: Introspective Cluster Scheduling for Deep Learning

Wencong Xiao; Romil Bhardwaj; Ramachandran Ramjee; Muthian Sivathanu; Nipun Kwatra; Zhenhua Han; Pratyush Patel; Xuan Peng; Hanyu Zhao; Quanlu Zhang; Fan Yang; Lidong Zhou

2018 OSDI OSDI 2018

Gandiva: Introspective Cluster Scheduling for Deep Learning

Abstract

We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU cluster. One key characteristic of deep learning is feedback-driven exploration, where a user often runs a set of jobs (or a multi-job) to achieve the best result for a specific mission and uses early feedback on accuracy to dynamically prioritize or kill a subset of jobs; simultaneous early feedback on the entire multi-job is critical. A second characteristic is the heterogeneity of deep learning jobs in terms of resource usage, making it hard to achieve best-fit a priori. Gandiva addresses these two challenges by exploiting a third key characteristic of deep learning: intra-job predictability, as they perform numerous repetitive iterations called mini-batch iterations. Gandiva exploits intra-job predictability to time-slice GPUs efficiently across multiple jobs, thereby delivering low-latency. This predictability is also used for introspecting job performance and dynamically migrating jobs to better-fit GPUs, thereby improving cluster efficiency. We show via a prototype implementation and micro-benchmarks that Gandiva can speed up hyper-parameter searches during deep learning by up to an order of magnitude, and achieves better utilization by transparently migrating and time-slicing jobs to achieve better job-to-resource fit. We also show that, in a real workload of jobs running in a 180-GPU cluster, Gandiva improves aggregate cluster utilization by 26%, pointing to a new way of managing large GPU clusters for deep learning.

🌉 Interdisciplinary Bridge — Computer Science and Machine Learning

📈 Trend Setter — Distributed Systems

🧭 Keyword Pioneer — gpu scheduling

🐣 Hot Topic Early Bird — deep learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Wencong Xiao , Romil Bhardwaj , Ramachandran Ramjee , Muthian Sivathanu , Nipun Kwatra , Zhenhua Han , Pratyush Patel , Xuan Peng , Hanyu Zhao , Quanlu Zhang , Fan Yang , Lidong Zhou

Topics

Machine Learning > Application Areas > Efficient Computing Computer Science > Systems > Distributed Systems

Keywords

deep learning cluster scheduling gpu scheduling job migration hyperparameter search

Download PDF

Related papers

Arachne: Core-Aware Thread Management 2018

Adaptive Dynamic Checkpointing for Safe Efficient Intermittent Computing 2018

The FuzzyLog: A Partially Ordered Shared Log 2018

Sledgehammer: Cluster-Fueled Debugging 2018

Obladi: Oblivious Serializable Transactions in the Cloud 2018