论文标题
Tripjudge:针对TripClick Health检索的相关判断测试收集
TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval
论文作者
论文摘要
强大的测试收集对于信息检索研究至关重要。最近,人们对评估特定领域的检索任务的检索系统越来越兴趣,但是这些任务通常缺乏可靠的测试收集,并在Cranfield范式之后使用人类通知的相关性评估。在医疗域中,最近提出了TripClick Collection,其中包含Trip搜索引擎中的点击日志数据,并包括两个基于单击的测试集。但是,点击偏向所使用的检索模型,该模型尚不清楚,先前的研究表明,测试集的判断范围降低了词汇和神经检索模型的前10名结果。在本文中,我们介绍了针对TripClick Health检索的小说相关判断测试收集。我们在注释活动中收集相关性判断,并通过各种排名的池创建方法,通过每个查询文档对的多种判断以及至少通过中等通知者的协议来确保Triphughe的质量和可重复性。我们将系统评估与Tripjudge和TripClick进行比较,发现基于点击和判断的评估可以导致实质上不同的系统排名。
Robust test collections are crucial for Information Retrieval research. Recently there is a growing interest in evaluating retrieval systems for domain-specific retrieval tasks, however these tasks often lack a reliable test collection with human-annotated relevance assessments following the Cranfield paradigm. In the medical domain, the TripClick collection was recently proposed, which contains click log data from the Trip search engine and includes two click-based test sets. However the clicks are biased to the retrieval model used, which remains unknown, and a previous study shows that the test sets have a low judgement coverage for the Top-10 results of lexical and neural retrieval models. In this paper we present the novel, relevance judgement test collection TripJudge for TripClick health retrieval. We collect relevance judgements in an annotation campaign and ensure the quality and reusability of TripJudge by a variety of ranking methods for pool creation, by multiple judgements per query-document pair and by an at least moderate inter-annotator agreement. We compare system evaluation with TripJudge and TripClick and find that that click and judgement-based evaluation can lead to substantially different system rankings.