kerple：内核相对位置嵌入长度外推

论文标题

kerple：内核相对位置嵌入长度外推

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

论文作者

Chi, Ta-Chung, Fan, Ting-Han, Ramadge, Peter J., Rudnicky, Alexander I.

论文摘要

相对位置嵌入（RPE）受到了很大的关注，因为RPE有效地模拟了令牌之间的相对距离并实现了长度外推。我们提出了Kerple，该框架概括了通过内核化位置差异嵌入外推的相对位置。我们使用有条件的正定义（CPD）内核来实现此目标，这是一类以概括距离指标而闻名的功能。为了维持自我注意的内部产品解释，我们表明可以通过添加恒定偏移来将CPD内核转化为PD内核。在自我注意力期间，这种偏移被隐式吸收在软疗法中。 CPD内核的多样性使我们能够得出各种RPE，这些RPE可以以原则性的方式推断长度。实验表明，对数变体在三个大型语言建模数据集上实现了出色的外推性能。我们的实施和预估计的检查点在https://github.com/chijames/kerple.git上发布。

Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. We achieve this goal using conditionally positive definite (CPD) kernels, a class of functions known for generalizing distance metrics. To maintain the inner product interpretation of self-attention, we show that a CPD kernel can be transformed into a PD kernel by adding a constant offset. This offset is implicitly absorbed in the Softmax normalization during self-attention. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way. Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets. Our implementation and pretrained checkpoints are released at https://github.com/chijames/KERPLE.git.

下载PDF全文

下载文献需遵守相关版权规定

论文标题