Mask Attention Networks: Rethinking and Strengthen Transformer

Zhihao Fan; Yeyun Gong; Dayiheng Liu; zhongyu wei; Siyuan Wang; Jian Jiao; Nan Duan; Ruofei Zhang; Xuanjing Huang

2021 NAACL NAACL 2021

Mask Attention Networks: Rethinking and Strengthen Transformer

Abstract

AbstractTransformer is an attention-based neural network, which consists of two sublayers, namely, Self-Attention Network (SAN) and Feed-Forward Network (FFN). Existing research explores to enhance the two sublayers separately to improve the capability of Transformer for text representation. In this paper, we present a novel understanding of SAN and FFN as Mask Attention Networks (MANs) and show that they are two special cases of MANs with static mask matrices. However, their static mask matrices limit the capability for localness modeling in text representation learning. We therefore introduce a new layer named dynamic mask attention network (DMAN) with a learnable mask matrix which is able to model localness adaptively. To incorporate advantages of DMAN, SAN, and FFN, we propose a sequential layered structure to combine the three types of layers. Extensive experiments on various tasks, including neural machine translation and text summarization demonstrate that our model outperforms the original Transformer.

🐣 Hot Topic Early Bird — feed-forward network

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Zhihao Fan , Yeyun Gong , Dayiheng Liu , zhongyu wei , Siyuan Wang , Jian Jiao , Nan Duan , Ruofei Zhang , Xuanjing Huang

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Model Architecture

Keywords

neural machine translation text summarization feed-forward network self-attention network mask attention

Download PDF

Related papers

Knowledge Router: Learning Disentangled Representations for Knowledge Graphs 2021

Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks 2021

Abstract Meaning Representation Guided Graph Encoding and Decoding for Joint Information Extraction 2021

Beyond Fair Pay: Ethical Implications of NLP Crowdsourcing 2021

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers 2021