论文标题
自我训练改善了自然语言理解的预训练
Self-training Improves Pre-training for Natural Language Understanding
论文作者
论文摘要
无监督的预训练导致了自然语言理解的最新进展。在本文中,我们将自我训练作为另一种通过半监督学习来利用未标记数据的方法。为了获取特定任务的其他数据,我们引入了Sendaugment,这是一种数据增强方法,该方法计算从标记的数据中计算特定于任务的查询嵌入,以从一组从网络爬行的数十亿个未标记的句子中检索句子。与以前的半监督方法不同,我们的方法不需要内域未标记的数据,因此通常适用。实验表明,自我训练是针对各种任务的强大罗伯塔基线的补充。我们的增强方法可实现可扩展有效的自我训练,并在标准文本分类基准上提高2.6%。最后,我们还表现出在知识贡献和几乎没有学习的学习方面的巨大成就。
Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning.