History of Neural Network Research since 2006

[On-going notes.]

Invention of pre-training

2006: Hinton & Salakhutdinov “Reducing the Dimensionality of Data with Neural Networks”

2007: Bengio “Greedy layer-wise training of deep networks”

Pre-training resolved the issue associated with training deep networks.

Glorot, X. and Bengio, Y. “Understanding the difficulty of training deep feedforward neural networks”

2nd order method

2010: Martens “Deep Learning via Hessian-Free optimization”

  • showed that HF “is capable of training DNNs from certain random initializations without the use of pre-training, and can achieve lower errors for the various auto-encoding tasks considered (by Hinton & Salakhutdinov” (Hinton 2013))

maybe SGD wasn’t that bad to train deep nets?

  • Notably, Chapelle & Erhan (2011) used the random initialization of Glorot & Bengio (2010) and SGD to train the 11-layer autoencoder of Hinton & Salakhutdinov (2006), and were able to surpass the results reported by Hinton & Salakhutdinov (2006). While these results still fall short of those reported in Martens (2010) for the same tasks, they indicate that learning deep networks is not nearly as hard as was previously believed.

Dropout

2012: Hinton [“Improving neural
networks by preventing co-adaptation of feature detectors”]
(http://arxiv.org/abs/1207.0580)

learning rate schedule for momentum

2013: Hinton “On the importance of initialization and momentum in deep learning”

  • when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization

Interesting remark:

  • Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

-> what is the reasoning behind this?

  • the optimization problem resembles an estimation one)

  • One explanation is that previous theoretical analyses and practical benchmarking focused on local convergence in the stochastic setting, which is more of an estimation problem than an optimization one (Bottou & LeCun, 2004). In deep learning problems this final phase of learning is not nearly as long or important as the initial “transient phase” (Darken & Moody, 1993), where a better argument can be made for the beneficial effects of momentum.

what does this estimation-optimization thing mean?