Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators

Jie Zhao; Siyuan Feng; Xiaoqiang Dan; Fei Liu; Chengke Wang; Sheng Yuan; Wenyuan Lv; Qikai Xie

2023 OSDI OSDI 2023

Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators

Abstract

Fully exploiting the computing power of an accelerator specialized for deep neural networks (DNNs) calls for the synergy between network and hardware architectures, but existing approaches partition a computational graph of DNN into multiple sub-graphs by abstracting away hardware architecture and assign resources to each sub-graph, not only producing redundant off-core data movements but also under-utilizing the hardware resources of a domain-specific architecture (DSA). This paper introduces a systematic approach for effectively scheduling DNN computational graphs on DSA platforms. By fully taking into account hardware architecture when partitioning a computational graph into coarse-grained sub-graphs, our work enables the synergy between network and hardware architectures, addressing several challenges of prior work: (1) it produces larger but fewer kernels, converting a large number of off-core data movements into on-core data exchanges; (2) it exploits the imbalanced memory usage distribution across DNN network architecture, better saturating the DSA memory hierarchy; (3) it enables across-layer instruction scheduling not studied before, further exploiting the parallelism across different specialized compute units. Results of seven DNN inference models on a DSA platform show that our work outperforms TVM and AStitch by 11.15× and 6.16×, respectively, and obtains throughput competitive to the vendor-crafted implementation. A case study on GPU also demonstrates that generating kernels for our sub-graphs can surpass CUTLASS with and without convolution fusion by 1.06× and 1.23×, respectively.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — domain-specific accelerator

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Jie Zhao , Siyuan Feng , Xiaoqiang Dan , Fei Liu , Chengke Wang , Sheng Yuan , Wenyuan Lv , Qikai Xie

Topics

Machine Learning > Application Areas > Efficient Computing Deep Learning > Techniques > Model Architecture

Keywords

kernel optimization deep neural network domain-specific accelerator computational graph scheduling memory hierarchy optimization

Download PDF

Related papers

EINNET: Optimizing Tensor Programs with Derivation-Based Transformations 2023

Triangulating Python Performance Issues with SCALENE 2023

Accountable authentication with privacy protection: The Larch system for universal login 2023

ExoFlow: A Universal Workflow System for Exactly-Once DAGs 2023

Conveyor: One-Tool-Fits-All Continuous Software Deployment at Meta 2023