稀疏升级：训练混合物的培训混合物

论文标题

稀疏升级：训练混合物的培训混合物

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

论文作者

Komatsuzaki, Aran, Puigcerver, Joan, Lee-Thorp, James, Ruiz, Carlos Riquelme, Mustafa, Basil, Ainslie, Joshua, Tay, Yi, Dehghani, Mostafa, Houlsby, Neil

论文摘要

培训大型的深层神经网络以融合的融合可能非常昂贵。结果，通常只有在不同的上下文和任务上重复使用的一小种少量流行的密集模型。越来越多的稀疏激活模型试图将模型大小从计算成本中解脱出来，它已成为密集模型的一种有吸引力的替代品。尽管在质量和计算成本方面更有效，但稀疏模型仍然渴望数据，并且在大规模制度中从头开始训练。在这项工作中，我们提出了稀疏的升级 - 一种简单的方法，可以通过从密集的检查点初始化稀疏激活的Experts模型来重复使用沉没培训成本。我们表明，稀疏升级的T5基础，大和XL语言模型以及视觉变压器基础和大型模型，在超级胶水和Imagenet上的密集量显着超过了它们的密集，仅使用约50％的初始浓缩预处理的沉没成本。升级型号还优于从头开始训练的稀疏模型，对初始训练预算预算的100％。

Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.

下载PDF全文

下载文献需遵守相关版权规定

论文标题