论文标题
自动汇总评估指标的重新检查系统级相关性
Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics
论文作者
论文摘要
自动汇总评估度量如何可靠地复制人类对摘要质量的判断是通过系统级相关性量化的。我们确定了两种方式,使系统级相关的定义与指标在实践中评估系统的使用方式不一致,并提出更改以纠正该断开连接。首先,我们使用完整的测试集计算自动度量的系统得分,而不是人类判断的摘要的子集,该摘要目前是标准实践。我们展示了这种小变化如何导致系统级相关性的更精确的估计。其次,我们建议仅在成对的系统上计算相关性,这些系统被通常在实践中观察到的自动分数差异很小。这使我们能够证明我们对胭脂与人类判断的相关性的最佳估计在现实情况下接近0。分析的结果表明,当系统分数差异很小时,需要收集更多高质量的人类判断并改善自动指标。
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and propose changes to rectify this disconnect. First, we calculate the system score for an automatic metric using the full test set instead of the subset of summaries judged by humans, which is currently standard practice. We demonstrate how this small change leads to more precise estimates of system-level correlations. Second, we propose to calculate correlations only on pairs of systems that are separated by small differences in automatic scores which are commonly observed in practice. This allows us to demonstrate that our best estimate of the correlation of ROUGE to human judgments is near 0 in realistic scenarios. The results from the analyses point to the need to collect more high-quality human judgments and to improve automatic metrics when differences in system scores are small.