论文标题
将上下文纳入子词词汇
Incorporating Context into Subword Vocabularies
论文作者
论文摘要
大多数当前流行的子字引导者是根据语料库上的单词频率统计信息进行培训的,而无需考虑有关共发生或上下文的信息。然而,由此产生的词汇量用于语言模型的高度上下文化设置。我们提出了Sage,这是一种令牌,它通过在词汇创建阶段的上下文化信号中烘烤来定制子词的下游使用。我们表明,Sage在保持令牌环境方面的凝聚力比当前的广泛代币剂做得更好,同时并没有在编码效率或域鲁棒性方面产生较高的价格。 Sage提高了英语胶水分类任务以及NER的性能,以及在土耳其语中的推理和NER上的表现,表明了其对语言特性(例如形态学和凝集)的稳健性。
Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models' highly contextualized settings. We present SaGe, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary creation phase. We show that SaGe does a better job than current widespread tokenizers in keeping token contexts cohesive, while not incurring a large price in terms of encoding efficiency or domain robustness. SaGe improves performance on English GLUE classification tasks as well as on NER, and on Inference and NER in Turkish, demonstrating its robustness to language properties such as morphological exponence and agglutination.