Abstract |
In this talk, I will present our recent work on the convergence/generalization analysis for the popular optimizers in deep learning. (1) We establish the convergence for Adam under (L0,L1 ) smoothness condition and argue that Adam can adapt to the local smoothness condition while SGD cannot. (2) We study the implicit regularization of DL optimizers. For adaptive optimizers, we prove that the convergent direction of RMSProp is the same with GD, while that of AdaGrad depends on the conditioner; for momentum acceleration, we prove that gradient descent with momentum converges to the L2 max-margin solution, which is the same as vanilla gradient descent. |