在重量衰减的被忽视的陷阱以及如何减轻它们的方式上

论文标题

在重量衰减的被忽视的陷阱以及如何减轻它们的方式上

On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective

论文作者

Xie, Zeke, Xu, Zhiqiang, Zhang, Jingzhao, Sato, Issei, Sugiyama, Masashi

论文摘要

重量衰减是一种简单而强大的正则化技术，已在训练深神经网络（DNN）中广泛使用。尽管体重衰减引起了很多关注，但以前的研究未能发现由于体重衰减而导致的大梯度规范的一些疏忽。在本文中，我们发现，不幸的是，在训练的最后阶段（或终止解决方案），重量衰减可能会导致较大的梯度规范，这通常表明收敛性不佳和概括不良。为了减轻以梯度为中心的陷阱，我们提出了第一个用于重量衰减的实用调度程序，称为预定重量衰减（SWD）方法，该方法可以根据梯度规范动态调节重量衰减强度，并在训练过程中显着惩罚大型梯度规范。我们的实验还支持SWD确实可以减轻大型梯度规范，并且通常明显优于传统的恒定重量衰减策略以进行自适应力矩估计（ADAM）。

Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs). While weight decay has attracted much attention, previous studies fail to discover some overlooked pitfalls on large gradient norms resulted by weight decay. In this paper, we discover that, weight decay can unfortunately lead to large gradient norms at the final phase (or the terminated solution) of training, which often indicates bad convergence and poor generalization. To mitigate the gradient-norm-centered pitfalls, we present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method that can dynamically adjust the weight decay strength according to the gradient norm and significantly penalize large gradient norms during training. Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam).

下载PDF全文

下载文献需遵守相关版权规定

论文标题