Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

Mingze Wang; Weinan E

2024 NIPS NeurIPS 2024

Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

Abstract

We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates.Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads.These theoretical insights are validated experimentally and offer natural suggestions for alternative architectures.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🧭 Keyword Pioneer — transformer expressive power

Authors

Mingze Wang , Weinan E

Topics

Machine Learning > Optimization & Theory > Learning Theory Machine Learning > Optimization & Theory > Theory Deep Learning > Architectures > Transformers Deep Learning > Techniques > Model Architecture Machine Learning > Learning Types > Representation Learning Deep Learning > Optimization & Theory > Neural Network Optimization Deep Learning > Optimization & Theory > Theory

Keywords

sequence modeling attention mechanism theoretical analysis expressive power approximation theory positional encoding long-range dependency feed-forward layer approximation property transformer expressive power

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024