Greedy Layer-Wise Training of Deep Networks

Yoshua Bengio; Pascal Lamblin; Dan Popovici; Hugo Larochelle

2006 NIPS NeurIPS 2006

Greedy Layer-Wise Training of Deep Networks

Abstract

Recent analyses (Bengio, Delalleau, & Le Roux, 2006; Bengio & Le Cun, 2007) of modern nonparametric machine learning algorithms that are kernel machines, such as Support Vector Machines (SVMs), graph-based manifold and semi-supervised learning algorithms suggest fundamental limitations of some learning algorithms. The problem is clear in kernel-based approaches when the kernel is "local" (e.g., the Gaussian kernel), i.e., K (x, y ) converges to a constant when ||x - y || increases. These analyses point to the difficulty of learning "highly-varying functions", i.e., functions that have a large number of "variations" in the domain of interest, e.g., they would require a large number of pieces to be well represented by a piecewise-linear approximation. Since the number of pieces can be made to grow exponentially with the number of factors of variations in the input, this is connected with the well-known curse of dimensionality for classical non-parametric learning algorithms (for regression, classification and density estimation). If the shapes of all these pieces are unrelated, one needs enough examples for each piece in order to generalize properly. However, if these shapes are related and can be predicted from each other, "non-local" learning algorithms have the potential to generalize to pieces not covered by the training set. Such ability would seem necessary for learning in complex domains such as Artificial Intelligence tasks (e.g., related to vision, language, speech, robotics). Kernel machines (not only those with a local kernel) have a shallow architecture, i.e., only two levels of data-dependent computational elements. This is also true of feedforward neural networks with a single hidden layer (which can become SVMs when the number of hidden units becomes large (Bengio, Le Roux, Vincent, Delalleau, & Marcotte, 2006)). A serious problem with shallow architectures is that they can be very inefficient in terms of the number of computational units (e.g., bases,

🚀 Conference Pioneer — NIPS 2006

🌱 Topic Pioneer — Model Architecture

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

📈 Trend Setter — Neural Network Optimization

🧭 Keyword Pioneer — greedy training

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🐣 Hot Topic Early Bird — feature learning

Authors

Yoshua Bengio , Pascal Lamblin , Dan Popovici , Hugo Larochelle

Topics

Machine Learning > Learning Types > Unsupervised Learning Machine Learning > Optimization & Theory > Neural Network Optimization Deep Learning > Architectures > Neural Networks Deep Learning > Techniques > Model Architecture Machine Learning > Learning Types > Deep Learning Deep Learning > Optimization & Theory > Optimization Deep Learning > Learning Types > Representation Learning

Keywords

feature learning manifold learning greedy training unsupervised pretraining greedy layer-wise training feature hierarchy deep neural network kernel machine deep network

Download PDF

Related papers

Temporal Coding using the Response Properties of Spiking Neurons 2006

Parameter Expanded Variational Bayesian Methods 2006

Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning 2006

Ordinal Regression by Extended Binary Classification 2006

Blind source separation for over-determined delayed mixtures 2006