论文标题
在训练语言模型的句子嵌入中
On the Sentence Embeddings from Pre-trained Language Models
论文作者
论文摘要
像伯特这样的预训练的上下文表示在自然语言处理方面取得了巨大的成功。但是,在没有微调的情况下,预先训练的语言模型中的句子嵌入句子却很差捕获句子的语义含义。在本文中,我们认为BERT嵌入中的语义信息并未完全利用。我们首先在理论上揭示了蒙版语言模型预训练目标与语义相似性任务之间的理论联系,然后从经验上分析BERT句子嵌入。我们发现伯特总是引起句子的非平滑各向异性语义空间,这会损害其语义相似性的性能。为了解决这个问题,我们建议通过通过以无监督目标学习的归一化流量来将嵌入分布的各向异性句子嵌入分布转变为平滑且各向同性的高斯分布。实验结果表明,我们提出的Bert-Flow方法在各种语义文本相似性任务上的最新句子嵌入中获得了显着的性能提高。该代码可在https://github.com/bohanli/bert-flow上找到。
Pre-trained contextual representations like BERT have achieved great success in natural language processing. However, the sentence embeddings from the pre-trained language models without fine-tuning have been found to poorly capture semantic meaning of sentences. In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We first reveal the theoretical connection between the masked language model pre-training objective and the semantic similarity task theoretically, and then analyze the BERT sentence embeddings empirically. We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity. To address this issue, we propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective. Experimental results show that our proposed BERT-flow method obtains significant performance gains over the state-of-the-art sentence embeddings on a variety of semantic textual similarity tasks. The code is available at https://github.com/bohanli/BERT-flow.