Guiding Attention for Self-Supervised Learning with Transformers

Ameet Deshpande; Karthik Narasimhan

2020 EMNLP EMNLP 2020

Guiding Attention for Self-Supervised Learning with Transformers

Abstract

AbstractIn this paper, we propose a simple and effective technique to allow for efficient self-supervised learning with bi-directional Transformers. Our approach is motivated by recent studies demonstrating that self-attention patterns in trained models contain a majority of non-linguistic regularities. We propose a computationally efficient auxiliary loss function to guide attention heads to conform to such patterns. Our method is agnostic to the actual pre-training objective and results in faster convergence of models as well as better performance on downstream tasks compared to the baselines, achieving state of the art results in low-resource settings. Surprisingly, we also find that linguistic properties of attention heads are not necessarily correlated with language modeling performance.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ameet Deshpande , Karthik Narasimhan

Topics

Machine Learning > Learning Types > Contrastive Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Architectures > Transformers Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Techniques > Attention Mechanism

Keywords

self-supervised learning attention mechanism language modeling downstream task bidirectional transformer auxiliary loss

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020