论文标题
数据腐败对低资源语言的指定实体识别的影响
The Impact of Data Corruption on Named Entity Recognition for Low-resourced Languages
论文作者
论文摘要
数据可用性和质量是低资源语言的自然语言处理的主要挑战。特别是,与资源更高的语言相比,可用的数据明显少得多。这些数据通常也是低质量的,充满错误,文本无效或不正确的注释。许多先前的工作着重于通过生成合成数据或过滤数据集的低质量部分来处理这些问题。相反,我们通过系统地测量数据数量和质量对在低资源的环境中进行预训练的语言模型的性能的影响更深入地研究这些因素。我们的结果表明,完全标记的句子少于拥有缺少标签的句子的句子要好得多。而且,只有10%的培训数据,该模型可以表现出色。重要的是,这些结果在十种低资源语言,英语和四种预训练的模型中是一致的。
Data availability and quality are major challenges in natural language processing for low-resourced languages. In particular, there is significantly less data available than for higher-resourced languages. This data is also often of low quality, rife with errors, invalid text or incorrect annotations. Many prior works focus on dealing with these problems, either by generating synthetic data, or filtering out low-quality parts of datasets. We instead investigate these factors more deeply, by systematically measuring the effect of data quantity and quality on the performance of pre-trained language models in a low-resourced setting. Our results show that having fewer completely-labelled sentences is significantly better than having more sentences with missing labels; and that models can perform remarkably well with only 10% of the training data. Importantly, these results are consistent across ten low-resource languages, English, and four pre-trained models.