重新审视基于变压器的语言模型的知识蒸馏

论文标题

重新审视基于变压器的语言模型的知识蒸馏

Knowledge Distillation of Transformer-based Language Models Revisited

论文作者

Lu, Chengqiang, Zhang, Jianwei, Chu, Yunfei, Chen, Zhengyu, Zhou, Jingren, Wu, Fei, Chen, Haiqing, Yang, Hongxia

论文摘要

在过去的几年中，基于变压器的预训练的语言模型在行业和学术界都取得了惊人的成功。但是，较大的模型尺寸和高运行时间延迟是实践中应用它们的严重障碍，尤其是在手机和物联网（IoT）设备上。为了压缩该模型，最近围绕知识蒸馏（KD）的主题长大了。然而，KD在基于变压器的模型中的工作方式仍不清楚。我们取消了KD的组件，并提出了一个统一的KD框架。通过框架，花费超过23,000个GPU小时的系统和广泛的实验，从知识类型，匹配策略，宽度深度折衷，初始化，模型大小等的角度进行全面的分析。我们的经验结果揭示了预先培训模型的蒸馏，并且对先前的State-State-State-tate-tate-teart-of-Temer-te-The-Arts（SOTA）（SOTA（SOTA））。最后，我们为基于变压器模型的KD提供了最佳实践指南。

In the past few years, transformer-based pre-trained language models have achieved astounding success in both industry and academia. However, the large model size and high run-time latency are serious impediments to applying them in practice, especially on mobile phones and Internet of Things (IoT) devices. To compress the model, considerable literature has grown up around the theme of knowledge distillation (KD) recently. Nevertheless, how KD works in transformer-based models is still unclear. We tease apart the components of KD and propose a unified KD framework. Through the framework, systematic and extensive experiments that spent over 23,000 GPU hours render a comprehensive analysis from the perspectives of knowledge types, matching strategies, width-depth trade-off, initialization, model size, etc. Our empirical results shed light on the distillation in the pre-train language model and with relative significant improvement over previous state-of-the-arts(SOTA). Finally, we provide a best-practice guideline for the KD in transformer-based models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题