论文标题

评估顶级$ K $偏好

Assessing top-$k$ preferences

论文作者

Clarke, Charles L. A., Vtyurina, Alexandra, Smucker, Mark D.

论文摘要

评估人员比分级判断更快,更稳定地做出偏好判断。偏好判断还可以识别出根据分级判断的等价物品之间的区别。不幸的是,偏好判断不仅需要线性的努力才能完全订购项目,而且偏好判决的评估措施不如NDCG等分级判断的评估措施。在本文中,我们探讨了部分偏好判断的评估过程,目的是识别和订购池中的顶级项目,而不是完全订购整个池。为了衡量排名者的性能,我们通过采用等级相似性度量将其输出与此首选订单进行比较。我们通过对TREC 2019对话辅助轨道的部分偏好来证明这种方法的实际可行性,用一种名为“兼容性”的新措施代替NDCG。在比较现代神经排名者时,这种新措施的影响最大,在那里它可以识别出NDCG否则会遗漏的质量的重大改善。

Assessors make preference judgments faster and more consistently than graded judgments. Preference judgments can also recognize distinctions between items that appear equivalent under graded judgments. Unfortunately, preference judgments can require more than linear effort to fully order a pool of items, and evaluation measures for preference judgments are not as well established as those for graded judgments, such as NDCG. In this paper, we explore the assessment process for partial preference judgments, with the aim of identifying and ordering the top items in the pool, rather than fully ordering the entire pool. To measure the performance of a ranker, we compare its output to this preferred ordering by applying a rank similarity measure.We demonstrate the practical feasibility of this approach by crowdsourcing partial preferences for the TREC 2019 Conversational Assistance Track, replacing NDCG with a new measure named "compatibility". This new measure has its most striking impact when comparing modern neural rankers, where it is able to recognize significant improvements in quality that would otherwise be missed by NDCG.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源