论文标题
全球收敛的基于梯度的双杆高参数优化方法
A Globally Convergent Gradient-based Bilevel Hyperparameter Optimization Method
论文作者
论文摘要
机器学习中的超参数优化通常是使用只会导致大约一组超级参数的幼稚技术来实现的。尽管诸如贝叶斯优化之类的技术在给定超参数的给定域进行了智能搜索,但它不能保证最佳解决方案。大多数这些方法的一个主要缺点是用超参数数量增加其搜索域的指数增加,从而增加了计算成本并使方法放缓。超参数优化问题本质上是双重优化任务,一些研究尝试了解决此问题的双重解决方案方法。但是,这些研究假设了一组独特的模型权重,以最大程度地减少训练损失,这通常受到深度学习体系结构的影响。本文讨论了一种基于梯度的双层方法,该方法解决了这些缺点以解决超参数优化问题。所提出的方法可以处理我们在实验中选择正则化高参数的连续超参数。该方法保证了本研究已在理论上证明的一组最佳超参数的收敛。该想法基于使用高斯过程回归近似较低级别的最佳值函数。结果,使用增强拉格朗日方法解决的单个级别约束优化任务将二重性问题缩小为单个级别的约束优化任务。我们已经对多层感知器和LENET架构的MNIST和CIFAR-10数据集进行了广泛的计算研究,以证实该方法的效率。一项针对网格搜索,随机搜索,贝叶斯优化和Hyberband方法的比较研究表明,所提出的算法会收敛于较低的计算,并导致模型在测试集上更好地推广。
Hyperparameter optimization in machine learning is often achieved using naive techniques that only lead to an approximate set of hyperparameters. Although techniques such as Bayesian optimization perform an intelligent search on a given domain of hyperparameters, it does not guarantee an optimal solution. A major drawback of most of these approaches is an exponential increase of their search domain with number of hyperparameters, increasing the computational cost and making the approaches slow. The hyperparameter optimization problem is inherently a bilevel optimization task, and some studies have attempted bilevel solution methodologies for solving this problem. However, these studies assume a unique set of model weights that minimize the training loss, which is generally violated by deep learning architectures. This paper discusses a gradient-based bilevel method addressing these drawbacks for solving the hyperparameter optimization problem. The proposed method can handle continuous hyperparameters for which we have chosen the regularization hyperparameter in our experiments. The method guarantees convergence to the set of optimal hyperparameters that this study has theoretically proven. The idea is based on approximating the lower-level optimal value function using Gaussian process regression. As a result, the bilevel problem is reduced to a single level constrained optimization task that is solved using the augmented Lagrangian method. We have performed an extensive computational study on the MNIST and CIFAR-10 datasets on multi-layer perceptron and LeNet architectures that confirms the efficiency of the proposed method. A comparative study against grid search, random search, Bayesian optimization, and HyberBand method on various hyperparameter problems shows that the proposed algorithm converges with lower computation and leads to models that generalize better on the testing set.