重新思考量表在封闭式学习中的作用：基于660亿个规模的基于可解释性的案例研究

论文标题

重新思考量表在封闭式学习中的作用：基于660亿个规模的基于可解释性的案例研究

Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale

论文作者

Bansal, Hritik, Gopalakrishnan, Karthik, Dingliwal, Saket, Bodapati, Sravan, Kirchhoff, Katrin, Roth, Dan

论文摘要

已经证明，语言模型可以通过秘密学习范式在各种任务上的规模提高而表现更好。在本文中，我们调查了一个假设，即大语言模型中文学习的能力并不统一地分布在其所有基础组件中。使用660亿个参数语言模型（OPT-66B）在一组14个下游任务中，我们发现确实如此：$ \ sim $ 70％的注意力头和$ \ sim $ \ sim $ 20％的馈送向前网络可以被删除，而任务绩效的下降最小。我们发现，在跨任务和文本示例的数量的关注头（UN）集合中的一组重叠（UN）中。我们还通过任务不足的镜头来解决我们的假设，发现OPT-66B中的一小部分注意力是他们执行与内在学习相关的原始感应操作的能力，即前缀匹配和复制。这些诱导头与特定于任务的重要头部重叠，这加强了Olsson等人的论点。（ARXIV：2209.11895）关于与文化学习相关的更复杂行为的诱导头部一般性。总体而言，我们的研究提供了几种见解，这些见解表明大型语言模型可能无法接受培训，无法在秘密学习中进行培训，并为如何预先培训语言模型提供了问题，以更有效地执行秘密学习。

Language models have been shown to perform better with an increase in scale on a wide variety of tasks via the in-context learning paradigm. In this paper, we investigate the hypothesis that the ability of a large language model to in-context learn-perform a task is not uniformly spread across all of its underlying components. Using a 66 billion parameter language model (OPT-66B) across a diverse set of 14 downstream tasks, we find this is indeed the case: $\sim$70% of attention heads and $\sim$20% of feed forward networks can be removed with minimal decline in task performance. We find substantial overlap in the set of attention heads (un)important for in-context learning across tasks and number of in-context examples. We also address our hypothesis through a task-agnostic lens, finding that a small set of attention heads in OPT-66B score highly on their ability to perform primitive induction operations associated with in-context learning, namely, prefix matching and copying. These induction heads overlap with task-specific important heads, reinforcing arguments by Olsson et al. (arXiv:2209.11895) regarding induction head generality to more sophisticated behaviors associated with in-context learning. Overall, our study provides several insights that indicate large language models may be under-trained for in-context learning and opens up questions on how to pre-train language models to more effectively perform in-context learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题