EL-Attention: Memory Efficient Lossless Attention for Generation

Yu Yan; Jiusheng Chen; Weizhen Qi; Nikhil Bhendawade; Yeyun Gong; Nan Duan; Ruofei Zhang

2021 ICML ICML 2021

EL-Attention: Memory Efficient Lossless Attention for Generation

Abstract

Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, cache for them is not needed. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yu Yan , Jiusheng Chen , Weizhen Qi , Nikhil Bhendawade , Yeyun Gong , Nan Duan , Ruofei Zhang

Topics

Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Natural Language Processing > Generation > Text Generation Deep Learning > Optimization & Theory > Efficient Computing

Keywords

attention mechanism text generation memory efficiency inference speed multi-head attention transformer model

Download PDF

Related papers

GRAND: Graph Neural Diffusion 2021

Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits 2021

Straight to the Gradient: Learning to Use Novel Tokens for Neural Text Generation 2021

Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution 2021

Dataset Dynamics via Gradient Flows in Probability Space 2021