佛罗伦萨：使用弱信号的模型评估

论文标题

佛罗伦萨：使用弱信号的模型评估

Firenze: Model Evaluation Using Weak Signals

论文作者

Soman, Bhavna, Torkamani, Ali, Morais, Michael J., Bickford, Jeffrey, Coskun, Baris

论文摘要

安全字段中的数据标签经常嘈杂，有限或偏向一部分人群。结果，诸如准确性，精度和召回指标之类的常见评估方法，或对从标记数据集计算的性能曲线进行分析，对机器学习（ML）模型的现实性能（ML）模型的现实性能没有足够的信心。这减慢了该领域的机器学习的采用。在当今的行业中，我们依靠域专业知识和冗长的手动评估来建立此信心，然后再运送新的安全应用程序模型。在本文中，我们介绍了Firenze，这是一种使用域专业知识对ML模型的性能进行比较评估的新型框架，并编码为称为标记的可扩展功能。我们表明，在称为感兴趣的区域的样本中计算和组合的标记物可以对其现实世界的性能提供强有力的估计。至关重要的是，我们使用统计假设检验来确保观察到的差异，因此从我们的框架中得出的结论比仅在噪声中可观察到的更为突出。使用模拟和两个现实世界数据集用于恶意软件和域名声誉检测，我们说明了方法的有效性，局限性和见解。综上所述，我们建议Firenze作为研究人员，领域专家和企业主的混合团队的快速，可解释和协作模型开发和评估的资源。

Data labels in the security field are frequently noisy, limited, or biased towards a subset of the population. As a result, commonplace evaluation methods such as accuracy, precision and recall metrics, or analysis of performance curves computed from labeled datasets do not provide sufficient confidence in the real-world performance of a machine learning (ML) model. This has slowed the adoption of machine learning in the field. In the industry today, we rely on domain expertise and lengthy manual evaluation to build this confidence before shipping a new model for security applications. In this paper, we introduce Firenze, a novel framework for comparative evaluation of ML models' performance using domain expertise, encoded into scalable functions called markers. We show that markers computed and combined over select subsets of samples called regions of interest can provide a robust estimate of their real-world performances. Critically, we use statistical hypothesis testing to ensure that observed differences-and therefore conclusions emerging from our framework-are more prominent than that observable from the noise alone. Using simulations and two real-world datasets for malware and domain-name-service reputation detection, we illustrate our approach's effectiveness, limitations, and insights. Taken together, we propose Firenze as a resource for fast, interpretable, and collaborative model development and evaluation by mixed teams of researchers, domain experts, and business owners.

下载PDF全文

下载文献需遵守相关版权规定

论文标题