ElitePLM：一项关于通用语言模型一般语言能力评估的实证研究

论文标题

ElitePLM：一项关于通用语言模型一般语言能力评估的实证研究

ElitePLM: An Empirical Study on General Language Ability Evaluation of Pretrained Language Models

论文作者

Li, Junyi, Tang, Tianyi, Gong, Zheng, Yang, Lixin, Yu, Zhuohao, Chen, Zhipeng, Wang, Jingyuan, Zhao, Wayne Xin, Wen, Ji-Rong

论文摘要

如今，验证的语言模型（PLM）占据了大多数NLP任务。而对于系统地评估PLM的语言能力的研究很少。在本文中，我们提出了一项关于PLMS（ElitePLM）一般语言能力评估的大规模实证研究。在我们的研究中，我们设计了四个评估维度，即记忆，理解，推理和组成，以测量五个类别中的十个普遍使用的PLM。我们的经验结果表明：（1）具有不同培训目标和策略的PLM擅长于不同的能力测试；（2）下游任务中的微调PLM通常对数据大小和分布敏感；（3）PLM在相似任务之间具有出色的可传递性。此外，我们的实验中PLM的预测结果是作为开放资源而发布的，以对PLM的语言能力进行更深入和详细的分析。本文可以指导未来的工作以选择，应用和设计特定任务的PLM。我们已经在https://github.com/rucaibox/eliteplm上公开提供了实验的所有详细信息。

Nowadays, pretrained language models (PLMs) have dominated the majority of NLP tasks. While, little research has been conducted on systematically evaluating the language abilities of PLMs. In this paper, we present a large-scale empirical study on general language ability evaluation of PLMs (ElitePLM). In our study, we design four evaluation dimensions, i.e. memory, comprehension, reasoning, and composition, to measure ten widely-used PLMs within five categories. Our empirical results demonstrate that: (1) PLMs with varying training objectives and strategies are good at different ability tests; (2) fine-tuning PLMs in downstream tasks is usually sensitive to the data size and distribution; (3) PLMs have excellent transferability between similar tasks. Moreover, the prediction results of PLMs in our experiments are released as an open resource for more deep and detailed analysis on the language abilities of PLMs. This paper can guide the future work to select, apply, and design PLMs for specific tasks. We have made all the details of experiments publicly available at https://github.com/RUCAIBox/ElitePLM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题