最佳伯特外科医生：大语模型的可扩展且准确的二阶修剪

论文标题

最佳伯特外科医生：大语模型的可扩展且准确的二阶修剪

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

论文作者

Kurtic, Eldar, Campos, Daniel, Nguyen, Tuan, Frantar, Elias, Kurtz, Mark, Fineran, Benjamin, Goin, Michael, Alistarh, Dan

论文摘要

基于变压器的语言模型已成为自然语言处理的关键基础。尽管这些模型非常准确，但它们可能太大且计算很大，无法在标准部署上运行。已知多种压缩方法，包括蒸馏，量化，结构化和非结构化的修剪会降低模型大小并增加推理速度，而精度损失较低。在这种情况下，本文的贡献是两个方面。我们对BERT模型的非结构化重量修剪的准确性压缩权衡进行了深入的研究。我们介绍了基于近似二阶信息的有效而准确的修剪方法，这是一种有效且准确的修剪方法，我们显示的是在语言任务的两个阶段中产生最新的结果：预训练和微调。具体而言，Obert通过允许修剪权重块并在BERT量表上适用，从而扩展了对非结构化二阶修剪的现有工作。其次，我们研究了这种修剪方法在复合压缩方法以获取高度压缩但准确的模型以在边缘设备上部署的影响。这些模型可显着针对所有指标的当前最新稀疏BERT模型的界限：模型大小，推理速度和任务准确性。例如，相对于密集的BERT碱基，我们获得了<1％精度下降的10倍模型大小压缩（以MB为单位），10倍CPU-推动力速度<2％精度下降和29倍CPU推动力，精度降低了7.5％。我们的代码与变压器和Sparseml完全集成，可在https://github.com/neuralmagic/sparseml/tree/main/main/research/optimal_bert_bert_surgeon_obert上找到。

Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, with low accuracy loss. In this context, this paper's contributions are two-fold. We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models. We introduce Optimal BERT Surgeon (oBERT), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-the-art results in both stages of language tasks: pre-training and fine-tuning. Specifically, oBERT extends existing work on unstructured second-order pruning by allowing for pruning blocks of weights, and by being applicable at the BERT scale. Second, we investigate the impact of this pruning method when compounding compression approaches to obtain highly compressed but accurate models for deployment on edge devices. These models significantly push boundaries of the current state-of-the-art sparse BERT models with respect to all metrics: model size, inference speed and task accuracy. For example, relative to the dense BERT-base, we obtain 10x model size compression (in MB) with < 1% accuracy drop, 10x CPU-inference speedup with < 2% accuracy drop, and 29x CPU-inference speedup with < 7.5% accuracy drop. Our code, fully integrated with Transformers and SparseML, is available at https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题