The Fine-Grained Complexity of Gradient Computation for Training Large Language Models

Josh Alman; Zhao Song

2024 NIPS NeurIPS 2024

The Fine-Grained Complexity of Gradient Computation for Training Large Language Models

Abstract

Large language models (LLMs) have made fundamental contributions over the last a few years. To train an LLM, one needs to alternatingly run `forward' computations and backward computations. The forward computation can be viewed as attention function evaluation, and the backward computation can be viewed as a gradient computation. In previous work by [Alman and Song, NeurIPS 2023], it was proved that the forward step can be performed in almost-linear time in certain parameter regimes, but that there is no truly sub-quadratic time algorithm in the remaining parameter regimes unless the popular hypothesis $\mathsf{SETH}$ is false. In this work, we show nearly identical results for the harder-seeming problem of computing the gradient of loss function of one layer attention network, and thus for the entire process of LLM training. This completely characterizes the fine-grained complexity of every step of LLM training.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🧭 Keyword Pioneer — attention function

Authors

Josh Alman , Zhao Song

Topics

Machine Learning > Optimization & Theory > Optimization Machine Learning > Optimization & Theory > Theory Natural Language Processing > Resources & Methods > Large Language Models Mathematics & Optimization > Optimization > Optimization Artificial Intelligence > Core AI > Large Language Models Deep Learning > Optimization & Theory > Optimization Deep Learning > Optimization & Theory > Theory

Keywords

attention mechanism neural network optimization computational complexity theoretical analysis fine-grained complexity computational efficiency training optimization gradient computation complexity bound attention network large language model large language model training attention function sub-quadratic algorithm

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024