FastSeq: Make Sequence Generation Faster

Yu Yan; Fei Hu; Jiusheng Chen; Nikhil Bhendawade; Ting Ye; Yeyun Gong; Nan Duan; Desheng Cui; Bingyu Chi; Ruofei Zhang

2021 ACL ACL 2021

FastSeq: Make Sequence Generation Faster

Abstract

AbstractTransformer-based models have made tremendous impacts in natural language generation. However the inference speed is a bottleneck due to large model size and intensive computing involved in auto-regressive decoding process. We develop FastSeq framework to accelerate sequence generation without accuracy loss. The proposed optimization techniques include an attention cache optimization, an efficient algorithm for detecting repeated n-grams, and an asynchronous generation pipeline with parallel I/O. These optimizations are general enough to be applicable to Transformer-based models (e.g., T5, GPT2, and UniLM). Our benchmark results on a set of widely used and diverse models demonstrate 4-9x inference speed gain. Additionally, FastSeq is easy to use with a simple one-line code change. The source code is available at https://github.com/microsoft/fastseq.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — transformer inference

🐣 Hot Topic Early Bird — inference optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yu Yan , Fei Hu , Jiusheng Chen , Nikhil Bhendawade , Ting Ye , Yeyun Gong , Nan Duan , Desheng Cui , Bingyu Chi , Ruofei Zhang

Topics

Machine Learning > Optimization & Theory > Optimization Machine Learning > Application Areas > Efficient Computing Natural Language Processing > Generation > Text Generation Deep Learning > Techniques > Transfer Learning Deep Learning > Optimization & Theory > Efficient Computing

Keywords

sequence generation parallel processing inference optimization model acceleration transformer inference auto-regressive decoding attention cache optimization

Download PDF

Related papers

Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training 2021

A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification 2021

How Did This Get Funded?! Automatically Identifying Quirky Scientific Achievements 2021

Exploring Discourse Structures for Argument Impact Classification 2021

Language Embeddings for Typology and Cross-lingual Transfer Learning 2021