DAPE V2: Process Attention Score as Feature Map for Length Extrapolation

Chuanyang Zheng; Yihang Gao; Han Shi; Jing Xiong; Jiankai Sun; Jingyao Li; Minbin Huang; Xiaozhe Ren; Michael Ng; Xin Jiang; Zhenguo Li; Yu Li

2025 ACL ACL 2025

DAPE V2: Process Attention Score as Feature Map for Length Extrapolation

Abstract

AbstractThe attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens. In general, the attention scores are determined simply by the key-query products. However, this work’s occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, **the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem**, which is called Convolutional Data-Adaptive Position Encoding (CDAPE).The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance.

🌉 Interdisciplinary Bridge — Computer Science and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chuanyang Zheng , Yihang Gao , Han Shi , Jing Xiong , Jiankai Sun , Jingyao Li , Minbin Huang , Xiaozhe Ren , Michael Ng , Xin Jiang , Zhenguo Li , Yu Li

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Model Architecture Computer Science > Applications > Computer Graphics Deep Learning > Optimization & Theory > Efficient Computing

Keywords

transformer architecture attention mechanism feature map length extrapolation position encoding convolutional feature map

Download PDF

Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights 2025

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision 2025

Structural Deep Encoding for Table Question Answering 2025

Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating 2025

DAPE V2: Process Attention Score as Feature Map for Length Extrapolation

Abstract

Authors

Topics

Keywords

Related papers