论文标题
道具:预先培训临时检索的代表性词预测
PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval
论文作者
论文摘要
最近,伯特(Bert)等预先训练的语言表示模型在包括信息检索(IR)在内的下游任务进行微调时表现出了很大的成功。但是,尚未对针对临时检索量身定制的预培训目标进行探索。在本文中,我们提出了对临时检索的代表性词预测(Prop)的预训练。道具灵感来自IR的经典统计语言模型,特别是查询可能性模型,该模型假设查询是“理想”文档的文本代表生成的。基于这个想法,我们构建了预训练的代表性词预测(ROP)任务。给定输入文档,我们根据文档语言模型对一对单词进行采样,其中具有较高可能性的集合被视为文档的代表。然后,我们预先培训变压器模型以预测两个单词集之间的成对偏好,并与蒙版语言模型(MLM)目标共同预测。通过进一步对各种代表性的下游临时检索任务进行微调,Prop可以在没有预训练或其他预训练方法的情况下对基准实现显着改善。我们还表明,在零资源和低资源IR设置下,Prop可以实现令人兴奋的性能。代码和预培训模型可在https://github.com/albert-ma/prop上获得。
Recently pre-trained language representation models such as BERT have shown great success when fine-tuned on downstream tasks including information retrieval (IR). However, pre-training objectives tailored for ad-hoc retrieval have not been well explored. In this paper, we propose Pre-training with Representative wOrds Prediction (PROP) for ad-hoc retrieval. PROP is inspired by the classical statistical language model for IR, specifically the query likelihood model, which assumes that the query is generated as the piece of text representative of the "ideal" document. Based on this idea, we construct the representative words prediction (ROP) task for pre-training. Given an input document, we sample a pair of word sets according to the document language model, where the set with higher likelihood is deemed as more representative of the document. We then pre-train the Transformer model to predict the pairwise preference between the two word sets, jointly with the Masked Language Model (MLM) objective. By further fine-tuning on a variety of representative downstream ad-hoc retrieval tasks, PROP achieves significant improvements over baselines without pre-training or with other pre-training methods. We also show that PROP can achieve exciting performance under both the zero- and low-resource IR settings. The code and pre-trained models are available at https://github.com/Albert-Ma/PROP.