Tensor Programs IIb: Architectural Universality Of Neural Tangent Kernel Training Dynamics

Greg Yang; Etai Littwin

2021 ICML ICML 2021

Tensor Programs IIb: Architectural Universality Of Neural Tangent Kernel Training Dynamics

Abstract

Yang (2020) recently showed that the Neural Tangent Kernel (NTK) at initialization has an infinite-width limit for a large class of architectures including modern staples such as ResNet and Transformers. However, their analysis does not apply to training. Here, we show the same neural networks (in the so-called NTK parametrization) during training follow a kernel gradient descent dynamics in function space, where the kernel is the infinite-width NTK. This completes the proof of the architectural universality of NTK behavior. To achieve this result, we apply the Tensor Programs technique: Write the entire SGD dynamics inside a Tensor Program and analyze it via the Master Theorem. To facilitate this proof, we develop a graphical notation for Tensor Programs, which we believe is also an important contribution toward the pedagogy and exposition of the Tensor Programs technique.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — kernel gradient descent

🐣 Hot Topic Early Bird — training dynamics

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Greg Yang , Etai Littwin

Topics

Machine Learning > Optimization & Theory > Learning Theory Machine Learning > Optimization & Theory > Theory Deep Learning > Architectures > Neural Networks Deep Learning > Optimization & Theory > Neural Network Optimization Deep Learning > Optimization & Theory > Theory

Keywords

neural tangent kernel function space training dynamics gradient descent dynamics infinite width limit kernel gradient descent infinite-width limit neural network architectural universality

Download PDF

Related papers

GRAND: Graph Neural Diffusion 2021

Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits 2021

Straight to the Gradient: Learning to Use Novel Tokens for Neural Text Generation 2021

Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution 2021

Dataset Dynamics via Gradient Flows in Probability Space 2021