Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

Heming Xia; Tao Ge; Peiyi Wang; Si-Qing Chen; Furu Wei; Zhifang Sui

2023 EMNLP EMNLP 2023

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

Abstract

AbstractWe propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter – an independent model specially optimized for efficient and accurate drafting – and Spec-Verification – a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around 5x speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only 1.4x~2x speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/hemingkx/SpecDec.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — token verification

🐣 Hot Topic Early Bird — speculative decoding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Heming Xia , Tao Ge , Peiyi Wang , Si-Qing Chen , Furu Wei , Zhifang Sui

Topics

Machine Learning > Optimization & Theory > Optimization Deep Learning > Models > Generative Models Natural Language Processing > Generation > Text Generation Deep Learning > Optimization & Theory > Optimization Deep Learning > Models > Transformers Deep Learning > Optimization & Theory > Model Compression Deep Learning > Optimization & Theory > Efficient Computing

Keywords

transformer architecture text generation generative model inference optimization speculative decoding token verification model acceleration beam search autoregressive decoding sequence-to-sequence model seq2seq generation

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023