GLUE-X：从分布式概括的角度评估自然语言理解模型

论文标题

GLUE-X：从分布式概括的角度评估自然语言理解模型

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

论文作者

Yang, Linyi, Zhang, Shuibai, Qin, Libo, Li, Yafu, Wang, Yidong, Liu, Hanmeng, Wang, Jindong, Xie, Xing, Zhang, Yue

论文摘要

已知预训练的语言模型（PLM）可以通过在训练前阶段利用大量数据来改善自然语言理解模型的泛化性能。但是，在许多NLP任务中，分布外（OOD）的概括问题仍然是一个挑战，限制了这些方法的现实部署。本文提出了创建一个名为Glue-X的统一基准的尝试，用于评估NLP模型中的OOD鲁棒性，强调了OOD鲁棒性的重要性，并提供了有关如何衡量模型鲁棒性以及如何改进它的见解。该基准包括13个用于OOD测试的公开数据集，并且对21个经典的NLP任务进行了评估，其中包括21个常用的PLM，包括GPT-3和GPT-3.5。我们的发现证实了在NLP任务中需要提高OOD准确性的必要性，因为与分布（ID）精度相比，在所有设置中都观察到了显着的性能降解。

Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 21 popularly used PLMs, including GPT-3 and GPT-3.5. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题