2019
NIPS
NeurIPS 2019
Are Sixteen Heads Really Better than One?
Abstract
Multi-headed attention is a driving force behind recent state-of-the-art NLP models. By applying multiple attention mechanisms in parallel, it can express sophisticated functions beyond the simple weighted average. However we observe that, in practice, a large proportion of attention heads can be removed at test time without significantly impacting performance, and that some layers can even be reduced to a single head. Further analysis on machine translation models reveals that the self-attention layers can be significantly pruned, while the encoder-decoder layers are more dependent on multi-headedness.
❓
The Questioner
🧭
Keyword Pioneer
— self-attention layer
🐣
Hot Topic Early Bird
— attention head
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio