Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Goro Kobayashi; Tatsuki Kuribayashi; Sho Yokoi; Kentaro Inui

2021 EMNLP EMNLP 2021

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Abstract

AbstractTransformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multi-head attention; other components can also contribute to Transformers’ progressive performance. In this study, we extended the scope of the analysis of Transformers from solely the attention patterns to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed. These results provide new intuitive explanations of existing reports; for example, discarding the learned attention patterns tends not to adversely affect the performance. The codes of our experiments are publicly available.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Goro Kobayashi , Tatsuki Kuribayashi , Sho Yokoi , Kentaro Inui

Topics

Artificial Intelligence > Core AI > Interpretability Deep Learning > Architectures > Transformers Deep Learning > Techniques > Model Architecture Natural Language Processing > Resources & Methods > Language Modeling Deep Learning > Techniques > Attention

Keywords

transformer architecture attention mechanism masked language model residual connection layer normalization attention pattern

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021