可微分的自适应学习率

论文标题

可微分的自适应学习率

Differentiable Self-Adaptive Learning Rate

论文作者

Chen, Bozhou, Wang, Hongzhi, Ba, Chenmin

论文摘要

学习率适应是机器学习中的一个流行话题。梯度下降以固定的学习率来训练神经夜总会。提出了通过调整培训课程中的步骤大小来加速培训过程的学习率调整。著名作品包括动量，亚当和超级毕业生。高度级是最特别的。通过计算成本功能学习率的导数并利用梯度下降来实现学习率，可以实现高度降级。但是，高降低仍然不完美。实际上，超级差异在学习率适应率很大后无法减少训练损失。除此之外，还发现有证据表明，高降低不适合以Minibatch培训形式处理大型日期。最不幸的是，尽管可以将训练损失降低到非常小的价值，但高降低总是无法在验证数据集上获得良好的准确性。为了解决高分量的问题，我们提出了一种新颖的适应算法，其中学习率是特定于参数和内部结构化的。与各种基准优化器相比，我们对多个网络模型和数据集进行了广泛的实验。结果表明，我们的算法比那些最先进的优化者可以更快，更高的合格收敛性。

Learning rate adaptation is a popular topic in machine learning. Gradient Descent trains neural nerwork with a fixed learning rate. Learning rate adaptation is proposed to accelerate the training process through adjusting the step size in the training session. Famous works include Momentum, Adam and Hypergradient. Hypergradient is the most special one. Hypergradient achieved adaptation by calculating the derivative of learning rate with respect to cost function and utilizing gradient descent for learning rate. However, Hypergradient is still not perfect. In practice, Hypergradient fail to decrease training loss after learning rate adaptation with a large probability. Apart from that, evidence has been found that Hypergradient are not suitable for dealing with large datesets in the form of minibatch training. Most unfortunately, Hypergradient always fails to get a good accuracy on the validation dataset although it could reduce training loss to a very tiny value. To solve Hypergradient's problems, we propose a novel adaptation algorithm, where learning rate is parameter specific and internal structured. We conduct extensive experiments on multiple network models and datasets compared with various benchmark optimizers. It is shown that our algorithm can achieve faster and higher qualified convergence than those state-of-art optimizers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题