论文标题
神经缩放定律的可解决模型
A Solvable Model of Neural Scaling Laws
论文作者
论文摘要
具有大量参数的大型语言模型在经过经验上训练了近乎互联网大小的代币数量,以遵守神经缩放定律:具体来说,它们的性能在参数或数据集大小中可以预测地作为幂定律,直到其他资源瓶颈。为了更好地理解这一点,我们首先确定允许出现这种缩放定律的必要属性,然后提出一个统计模型 - 一种关节生成数据模型和随机特征模型,该模型捕获了这种神经缩放现象学。通过在大型训练集大小和大量参数的双重限制中求解该模型,我们可以深入了解(i)导致缩放定律的数据集和任务的统计结构,(ii)非线性特征图的方式,例如神经网络提供的方式,在这些数据集中训练和(iii)的训练,(iii)训练的量表,(iii)训练,(iii)训练的量表,(iii)训练量表,(iii)训练量表,则是训练的范围。缩放法律可能会崩溃,以及他们的行为方式。关键发现是自然数据集统计数据中发生的功率定律的方式,通过非线性随机特征图扩展,然后转化为测试损失的幂律量表,以及数据频谱功率定律的有限范围如何使模型的性能降低了高原。
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. By solving this model in the dual limit of large training set size and large number of parameters, we gain insight into (i) the statistical structure of datasets and tasks that lead to scaling laws, (ii) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when trained on these datasets, (iii) the optimality of the equiparameterization scaling of training sets and parameters, and (iv) whether such scaling laws can break down and how they behave when they do. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps and then translated into power-law scalings of the test loss and how the finite extent of the data's spectral power law causes the model's performance to plateau.