HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices

Ruilong Ma; Xiang Yang; Jingyu Wang; Qi Qi; Haifeng Sun; Jing Wang; Zirui Zhuang; Jianxin Liao

2024 NAACL NAACL 2024

HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices

Abstract

AbstractMicro-enterprises and individual developers emerge analysis demands for long sequence with powerful Large Language Models (LLMs). They try to deploy the LLMs at local, but only possess various commodity devices and the unreliable interconnection between devices. Existing parallel techniques do not lead to the same effectiveness in limited environment. The heterogeneity of devices, coupled with their limited capacity and expensive communication, brings challenges to private deployment for maximized utilization of available devices while masking latency. Hence, we introduce HPipe, a pipeline inference framework that successfully mitigates LLMs from high-performance clusters to heterogeneous commodity devices. By ensuring a balanced distribution of workloads, HPipe facilitates the parallel execution of LLMs through pipelining the sequences on the token dimension. The evaluation conducted on LLaMA-7B and GPT3-2B demonstrates that HPipe holds the potential for context analysis on LLM with heterogeneity devices, achieving an impressive speedup in latency and throughput up to 2.28 times.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ruilong Ma , Xiang Yang , Jingyu Wang , Qi Qi , Haifeng Sun , Jing Wang , Zirui Zhuang , Jianxin Liao

Topics

Machine Learning > Optimization & Theory > Distributed Learning Machine Learning > Application Areas > Efficient Computing

Keywords

distributed inference pipeline parallelism model deployment heterogeneous computing large language model

Download PDF

Related papers

Working Alliance Transformer for Psychotherapy Dialogue Classification 2024

Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences 2024

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study 2024

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation 2024

Extractive Summarization with Text Generator 2024