论文标题
具有非线性缀合物梯度式自适应动量的随机梯度下降
Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum
论文作者
论文摘要
动量在基于随机梯度的优化算法中起着至关重要的作用,用于加速或改善训练深神经网络(DNNS)。在深度学习实践中,势头通常由良好的常数加权。但是,对动量进行调整超参数可能是一个重大的计算负担。在本文中,我们提出了一种小说\ emph {自适应动量},以改善DNNS培训。这种自适应动量(无需动量与参数相关的高参数)是由非线性结合梯度(NCG)方法激励的。具有这种新型自适应动量的随机梯度下降(SGD)消除了对动量超参数校准的需求,使学习率明显更高,加速了DNN训练,并提高了受过训练的DNN的最终准确性和鲁棒性。例如,具有这种自适应动量的SGD将CIFAR10和CIFAR100的培训RESNET110的分类错误从$ 5.25 \%$ $降低到$ 4.64 \%\%\%\%$和$ 23.75 \%\%$ $ $ $ $ $ $ $ $。此外,具有新的自适应动量的SGD还有益于对抗性训练,并改善了受过训练的DNN的对抗性鲁棒性。
Momentum plays a crucial role in stochastic gradient-based optimization algorithms for accelerating or improving training deep neural networks (DNNs). In deep learning practice, the momentum is usually weighted by a well-calibrated constant. However, tuning hyperparameters for momentum can be a significant computational burden. In this paper, we propose a novel \emph{adaptive momentum} for improving DNNs training; this adaptive momentum, with no momentum related hyperparameter required, is motivated by the nonlinear conjugate gradient (NCG) method. Stochastic gradient descent (SGD) with this new adaptive momentum eliminates the need for the momentum hyperparameter calibration, allows a significantly larger learning rate, accelerates DNN training, and improves final accuracy and robustness of the trained DNNs. For instance, SGD with this adaptive momentum reduces classification errors for training ResNet110 for CIFAR10 and CIFAR100 from $5.25\%$ to $4.64\%$ and $23.75\%$ to $20.03\%$, respectively. Furthermore, SGD with the new adaptive momentum also benefits adversarial training and improves adversarial robustness of the trained DNNs.