ALP-KD：知识蒸馏的基于注意力的层投影

论文标题

ALP-KD：知识蒸馏的基于注意力的层投影

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

论文作者

Passban, Peyman, Wu, Yimeng, Rezagholizadeh, Mehdi, Liu, Qun

论文摘要

知识蒸馏被认为是一种培训和压缩策略，在培训期间，两个神经网络（即教师和学生）耦合在一起。教师网络应该是一个值得信赖的预测指标，学生试图模仿其预测。通常，选择具有较轻体系结构的学生，以便我们可以达到压缩并提供高质量的结果。在这种情况下，蒸馏只有最终预测才会发生，而学生也可以从教师对内部组件的监督中受益。在此激励的情况下，我们研究了中间层蒸馏的问题。由于学生和教师层之间可能没有一对一的对齐方式，因此现有技术跳过了一些教师层，只从其中的一部分中提取。这种缺点直接影响质量，因此我们提出了一种依靠关注的组合技术。我们的模型融合了教师端的信息，并考虑了每一层的意义，然后在组合的教师层次和学生的层次之间进行蒸馏。使用我们的技术，我们将12层Bert（Devlin等人2019）提炼成6，4和2层的对应物，并在胶水任务上进行了评估（Wang等，2018）。实验结果表明，我们的组合方法能够优于其他现有技术。

Knowledge distillation is considered as a training and compression strategy in which two neural networks, namely a teacher and a student, are coupled together during training. The teacher network is supposed to be a trustworthy predictor and the student tries to mimic its predictions. Usually, a student with a lighter architecture is selected so we can achieve compression and yet deliver high-quality results. In such a setting, distillation only happens for final predictions whereas the student could also benefit from teacher's supervision for internal components. Motivated by this, we studied the problem of distillation for intermediate layers. Since there might not be a one-to-one alignment between student and teacher layers, existing techniques skip some teacher layers and only distill from a subset of them. This shortcoming directly impacts quality, so we instead propose a combinatorial technique which relies on attention. Our model fuses teacher-side information and takes each layer's significance into consideration, then performs distillation between combined teacher layers and those of the student. Using our technique, we distilled a 12-layer BERT (Devlin et al. 2019) into 6-, 4-, and 2-layer counterparts and evaluated them on GLUE tasks (Wang et al. 2018). Experimental results show that our combinatorial approach is able to outperform other existing techniques.

下载PDF全文

下载文献需遵守相关版权规定

论文标题