Multi-Scale Self-Attention for Text Classification

Qipeng Guo; Xipeng Qiu; Pengfei Liu; Xiangyang Xue; Zheng Zhang

2020 AAAI AAAI 2020

Multi-Scale Self-Attention for Text Classification

Abstract

Abstract In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — multi-scale self-attention

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Qipeng Guo , Xipeng Qiu , Pengfei Liu , Xiangyang Xue , Zheng Zhang

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Transformers Natural Language Processing > Applications > Text Classification Natural Language Processing > Resources & Methods > Large Language Models Deep Learning > Techniques > Attention Mechanism

Keywords

text classification attention mechanism pre-trained transformer multi-scale transformer scale distribution multi-scale self-attention

Download PDF

Related papers

Enhancing Pointer Network for Sentence Ordering with Pairwise Ordering Predictions 2020

CopyMTL: Copy Mechanism for Joint Extraction of Entities and Relations with Multi-Task Learning 2020

Neural Simile Recognition with Cyclic Multitask Learning and Local Attention 2020

Being Optimistic to Be Conservative: Quickly Learning a CVaR Policy 2020

Multi-Point Semantic Representation for Intent Classification 2020