论文标题

Precision-Recall曲线(PRC)分类树

Precision-Recall Curve (PRC) Classification Trees

论文作者

Miao, Jiaju, Zhu, Wei

论文摘要

不平衡数据的分类对大多数众所周知的分类算法提出了重大挑战,这些算法通常是为具有相对平衡的类分布的数据而设计的。然而,偏斜的班级分布是现实世界中问题中的共同特征。在某些对机器学习和更好的预测分析(例如疾病诊断,欺诈检测,破产预测和可疑识别)等某些应用领域中,它尤其普遍。在本文中,我们提出了一种基于树木的新型算法,该算法基于Precision-Recall曲线(AUPRC)下的区域,用于分类环境中的可变选择。我们的算法被称为“ Precision-Recall曲线分类树”,或者简单地将“ PRC分类树”修改为树建筑中的两个关键阶段。第一阶段是在节点变量选择中最大化Precision-Recall曲线下的区域。第二阶段是最大程度地提高回忆和精度(F量)的谐波平均值(F-量)。我们发现拟议的PRC分类树及其随后的扩展,PRC随机森林,尤其是在类不平衡数据集的情况下运作良好。我们已经证明,对于合成数据和真实数据,我们的方法的表现优于他们的经典同行,通常的车和随机森林。此外,我们组提出的ROC分类树先前在数据不平衡的数据中表现出良好的性能。它们的结合,即中国 - roc树,在识别少数群体方面也表现出了巨大的希望。

The classification of imbalanced data has presented a significant challenge for most well-known classification algorithms that were often designed for data with relatively balanced class distributions. Nevertheless skewed class distribution is a common feature in real world problems. It is especially prevalent in certain application domains with great need for machine learning and better predictive analysis such as disease diagnosis, fraud detection, bankruptcy prediction, and suspect identification. In this paper, we propose a novel tree-based algorithm based on the area under the precision-recall curve (AUPRC) for variable selection in the classification context. Our algorithm, named as the "Precision-Recall Curve classification tree", or simply the "PRC classification tree" modifies two crucial stages in tree building. The first stage is to maximize the area under the precision-recall curve in node variable selection. The second stage is to maximize the harmonic mean of recall and precision (F-measure) for threshold selection. We found the proposed PRC classification tree, and its subsequent extension, the PRC random forest, work well especially for class-imbalanced data sets. We have demonstrated that our methods outperform their classic counterparts, the usual CART and random forest for both synthetic and real data. Furthermore, the ROC classification tree proposed by our group previously has shown good performance in imbalanced data. The combination of them, the PRC-ROC tree, also shows great promise in identifying the minority class.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源