KVISTUR 2.0：冰岛的Bilstm复合分离器

论文标题

KVISTUR 2.0：冰岛的Bilstm复合分离器

Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic

论文作者

Daðason, Jón Friðrik, Mollberg, David Erik, Loftsson, Hrafn, Bjarnadóttir, Kristín

论文摘要

在本文中，我们提出了一个基于角色的BilstM模型，用于拆分冰岛化合物词，并展示不同量的培训数据如何影响模型的性能。复合在冰岛的生产力很高，并且正在不断创造新的化合物。这会导致大量的vocabulary（OOV）单词，从而对许多NLP工具的性能产生负面影响。我们的模型在冰岛形态数据库的290万个独特单词形式及其组成结构的数据集上进行了培训。该模型学习了如何将复合单词分为两个部分，可以用来得出任何单词形式的组成结构。知道单词形式的组成结构使得可以为给定任务生成最佳拆分，例如，用于子词令牌化的完整拆分，或者，在词性标记的情况下，将OOV单词分开直至找到最大的已知形态头。当对手动拆分单词形式的语料库进行评估时，该模型的表现优于其他先前发布的方法。该方法已集成到Kvistur，这是一种冰岛复合单词分析仪。

In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns how to split compound words into two parts and can be used to derive the constituent structure of any word form. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task, e.g., a full split for subword tokenization, or, in the case of part-of-speech tagging, splitting an OOV word until the largest known morphological head is found. The model outperforms other previously published methods when evaluated on a corpus of manually split word forms. This method has been integrated into Kvistur, an Icelandic compound word analyzer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题