论文标题
深度学习的好处,非凸噪声梯度下降:可证明的过剩风险绑定和对内核方法的优势
Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods
论文作者
论文摘要
建立一个理论分析,解释为什么深度学习能够胜过较浅的学习,例如内核方法是深度学习文献中最大的问题之一。为了回答这个问题,我们评估了由嘈杂的梯度下降训练的深度学习估计量的过多风险,并在轻度过度参数化的神经网络上进行了岭正规化,并讨论了其优于一类线性估计器的优越性,该线性估计器包括神经切线核方法,包括随机功能模型,其他kernel方法,$ k $ -nn估计器等等。我们考虑了一个教师研究的回归模型,并最终表明,从最小值的最佳速度方面,尤其是在高维度设置的情况下,深度学习都可以优于任何线性估计器。所获得的过度界限是所谓的快速学习率,它比通常的Rademacher复杂性分析获得的$ O(1/\ sqrt {n})$更快。这种差异是由模型的非凸线几何形状引起的,即使损失景观高度非凸面,用于神经网络训练的嘈杂梯度下降也达到了几乎全球最佳解决方案。尽管嘈杂的梯度下降不采用任何显式或隐式稀疏性引起正则化,但它显示出优选的概括性能,主导线性估计器。
Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature. Towards answering this question, we evaluate excess risk of a deep learning estimator trained by a noisy gradient descent with ridge regularization on a mildly overparameterized neural network, and discuss its superiority to a class of linear estimators that includes neural tangent kernel approach, random feature model, other kernel methods, $k$-NN estimator and so on. We consider a teacher-student regression model, and eventually show that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting. The obtained excess bounds are so-called fast learning rate which is faster than $O(1/\sqrt{n})$ that is obtained by usual Rademacher complexity analysis. This discrepancy is induced by the non-convex geometry of the model and the noisy gradient descent used for neural network training provably reaches a near global optimal solution even though the loss landscape is highly non-convex. Although the noisy gradient descent does not employ any explicit or implicit sparsity inducing regularization, it shows a preferable generalization performance that dominates linear estimators.