在机器学习模型中种植未检测到的后门

论文标题

在机器学习模型中种植未检测到的后门

Planting Undetectable Backdoors in Machine Learning Models

论文作者

Goldwasser, Shafi, Kim, Michael P., Vaikuntanathan, Vinod, Zamir, Or

论文摘要

鉴于培训机器学习模型所需的计算成本和技术专长，用户可以将学习任务委托给服务提供商。我们展示了恶意学习者如何将无法检测到的后门种植到分类器中。从表面上看，这样的后门分类器的行为正常，但实际上，学习者保持了一种改变任何输入分类的机制，只有轻微的扰动。重要的是，如果没有适当的“后门钥匙”，该机制就会隐藏起来，无法通过任何计算结合的观察者检测到。我们展示了两个用于种植无法检测到的后门的框架，并提供了无与伦比的保证。首先，我们使用数字签名方案展示了如何在任何模型中种植后门。该构造保证了给定的黑框访问原始型号和后门版本，在计算上甚至可以找到它们不同的单个输入。该属性意味着后门模型具有与原始模型相当的概括错误。其次，我们演示了如何在使用随机傅立叶特征（RFF）学习范式或随机relu网络训练的模型中插入不可检测的后门。在这种结构中，无法检测到功能强大的白色盒子区别值：鉴于对网络和培训数据的完整描述，没有有效的区别者可以猜测该模型是“清洁”还是包含后门。我们无法检测到的后门的构建也阐明了与对抗性例子的鲁棒性相关问题。特别是，我们的构造可以产生一个与“对抗性强大的”分类器无法区分的分类器，但是每个输入都有一个对抗性示例！总而言之，无法检测到的后门的存在代表了证明对抗性鲁棒性的重要理论障碍。

Given the computational cost and technical expertise required to train machine learning models, users may delegate the task of learning to a service provider. We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees. First, we show how to plant a backdoor in any model, using digital signature schemes. The construction guarantees that given black-box access to the original model and the backdoored version, it is computationally infeasible to find even a single input where they differ. This property implies that the backdoored model has generalization error comparable with the original model. Second, we demonstrate how to insert undetectable backdoors in models trained using the Random Fourier Features (RFF) learning paradigm or in Random ReLU networks. In this construction, undetectability holds against powerful white-box distinguishers: given a complete description of the network and the training data, no efficient distinguisher can guess whether the model is "clean" or contains a backdoor. Our construction of undetectable backdoors also sheds light on the related issue of robustness to adversarial examples. In particular, our construction can produce a classifier that is indistinguishable from an "adversarially robust" classifier, but where every input has an adversarial example! In summary, the existence of undetectable backdoors represent a significant theoretical roadblock to certifying adversarial robustness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题