Are Sixteen Heads Really Better than One?

Paul Michel; Omer Levy; Graham Neubig

2019 NIPS NeurIPS 2019

Are Sixteen Heads Really Better than One?

Abstract

Multi-headed attention is a driving force behind recent state-of-the-art NLP models. By applying multiple attention mechanisms in parallel, it can express sophisticated functions beyond the simple weighted average. However we observe that, in practice, a large proportion of attention heads can be removed at test time without significantly impacting performance, and that some layers can even be reduced to a single head. Further analysis on machine translation models reveals that the self-attention layers can be significantly pruned, while the encoder-decoder layers are more dependent on multi-headedness.

❓ The Questioner

🧭 Keyword Pioneer — self-attention layer

🐣 Hot Topic Early Bird — attention head

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Paul Michel , Omer Levy , Graham Neubig

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Model Architecture

Keywords

attention head encoder-decoder attention self-attention layer head pruning multi-headed attention

Download PDF

Related papers

Two Generator Game: Learning to Sample via Linear Goodness-of-Fit Test 2019

Metalearned Neural Memory 2019

Model Similarity Mitigates Test Set Overuse 2019

Continual Unsupervised Representation Learning 2019

Reinforcement Learning with Convex Constraints 2019