安全缺陷的嘈杂标签学习

论文标题

安全缺陷的嘈杂标签学习

Noisy Label Learning for Security Defects

论文作者

Croft, Roland, Babar, M. Ali, Chen, Huaming

论文摘要

数据驱动的软件工程流程，例如漏洞预测在很大程度上依赖于所使用的数据的质量。在本文中，我们观察到在实践中获得无噪声的安全缺陷数据集是不可行的。尽管有易受伤害的阶级，但由于可用的手动工作有限，难以验证的不可验证的模块被验证并确定为真正的免费剥削。它导致不确定性，引入数据集中的标记噪声并影响结论有效性。为了解决这个问题，我们提出了新颖的学习方法，这些方法可用于标记杂质并可以从有限的标签数据中利用最大的标签。嘈杂的标签学习。我们研究了应用于软件漏洞预测的各种嘈杂的标签学习方法。具体而言，我们提出了一种基于噪声清洁的两阶段学习方法，以识别和补救嘈杂的样本，该样本将AUC和回忆分别提高了8.9％和23.4％。此外，我们讨论了几个障碍，以实现具有标签噪声的半知识知识的性能上限。总体而言，实验结果表明，嘈杂标签的学习对于数据驱动的软件和安全分析可能是有效的。

Data-driven software engineering processes, such as vulnerability prediction heavily rely on the quality of the data used. In this paper, we observe that it is infeasible to obtain a noise-free security defect dataset in practice. Despite the vulnerable class, the non-vulnerable modules are difficult to be verified and determined as truly exploit free given the limited manual efforts available. It results in uncertainty, introduces labeling noise in the datasets and affects conclusion validity. To address this issue, we propose novel learning methods that are robust to label impurities and can leverage the most from limited label data; noisy label learning. We investigate various noisy label learning methods applied to software vulnerability prediction. Specifically, we propose a two-stage learning method based on noise cleaning to identify and remediate the noisy samples, which improves AUC and recall of baselines by up to 8.9% and 23.4%, respectively. Moreover, we discuss several hurdles in terms of achieving a performance upper bound with semi-omniscient knowledge of the label noise. Overall, the experimental results show that learning from noisy labels can be effective for data-driven software and security analytics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题