Jump Self-attention: Capturing High-order Statistics in Transformers

Haoyi Zhou; Siyang Xiao; Shanghang Zhang; Jieqi Peng; SHUAI ZHANG; Jianxin Li

2022 NIPS NeurIPS 2022

Jump Self-attention: Capturing High-order Statistics in Transformers

Abstract

The recent success of Transformer has benefited many real-world applications, with its capability of building long dependency through pairwise dot-products. However, the strong assumption that elements are directly attentive to each other limits the performance of tasks with high-order dependencies such as natural language understanding and Image captioning. To solve such problems, we are the first to define the Jump Self-attention (JAT) to build Transformers. Inspired by the pieces moving of English Draughts, we introduce the spectral convolutional technique to calculate JAT on the dot-product feature map. This technique allows JAT's propagation in each self-attention head and is interchangeable with the canonical self-attention. We further develop the higher-order variants under the multi-hop assumption to increase the generality. Moreover, the proposed architecture is compatible with the pre-trained models. With extensive experiments, we empirically show that our methods significantly increase the performance on ten different tasks.

🌉 Interdisciplinary Bridge — Computer Vision and Natural Language Processing

🧭 Keyword Pioneer — high-order dependencies

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Haoyi Zhou , Siyang Xiao , Shanghang Zhang , Jieqi Peng , SHUAI ZHANG , Jianxin Li

Topics

Computer Vision > Generation > Image Captioning Natural Language Processing > Understanding > Semantic Analysis

Keywords

natural language understanding spectral convolution high-order dependencies

Download PDF

Related papers

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching 2022

A Theoretical View on Sparsely Activated Networks 2022

Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks 2022

Matryoshka Representation Learning 2022

Off-Policy Evaluation with Deficient Support Using Side Information 2022