隐藏的数据中毒攻击对NLP模型

论文标题

隐藏的数据中毒攻击对NLP模型

Concealed Data Poisoning Attacks on NLP Models

论文作者

Wallace, Eric, Zhao, Tony Z., Feng, Shi, Singh, Sameer

论文摘要

对抗性攻击通过扰动测试时间输入来改变NLP模型预测。但是，是否可以通过对培训数据的较小的隐藏更改来操纵预测以及如何进行预测。在这项工作中，我们开发了一种新的数据中毒攻击，该攻击使对手可以在输入中存在所需的触发短语时控制模型预测。例如，我们将50个毒药示例插入一个情感模型的训练集中，该训练集会导致该模型在输入中含有“詹姆斯·邦德”时经常预测积极。至关重要的是，我们使用基于梯度的程序制作了这些毒药实例，以免他们提及触发短语。我们还将毒药攻击应用于语言建模（“苹果iPhone”触发了负面的一代）和机器翻译（“冰咖啡”被误导为“热咖啡”）。我们通过提出三个可以在预测准确性或额外的人类注释中以一定代价减轻攻击的防御措施来结束。

Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model's training set that causes the model to frequently predict Positive whenever the input contains "James Bond". Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling ("Apple iPhone" triggers negative generations) and machine translation ("iced coffee" mistranslated as "hot coffee"). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题