Adaptive Attention Span in Transformers

Sainbayar Sukhbaatar; Edouard Grave; Piotr Bojanowski; Armand Joulin

2019 ACL ACL 2019

Adaptive Attention Span in Transformers

Abstract

AbstractWe propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 and enwiki8 by using a maximum context of 8k characters.

🌉 Interdisciplinary Bridge — Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — context size

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sainbayar Sukhbaatar , Edouard Grave , Piotr Bojanowski , Armand Joulin

Topics

Deep Learning > Architectures > Transformers Natural Language Processing > Generation > Language Modeling Deep Learning > Learning Types > Representation Learning Deep Learning > Optimization & Theory > Efficient Computing

Keywords

transformer architecture self-attention mechanism language modeling memory footprint context size adaptive attention span character level language modeling attention span

Download PDF

Related papers

What do phone embeddings learn about Phonology? 2019

Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages 2019

Understanding Undesirable Word Embedding Associations 2019

Inferential Machine Comprehension: Answering Questions by Recursively Deducing the Evidence Chain from Text 2019

Domain Adaptation of Neural Machine Translation by Lexicon Induction 2019