从英雄到Zéroe：低级对抗攻击的基准

论文标题

从英雄到Zéroe：低级对抗攻击的基准

From Hero to Zéroe: A Benchmark of Low-Level Adversarial Attacks

论文作者

Eger, Steffen, Benz, Yannik

论文摘要

对抗性攻击是针对旨在愚弄机器而非人类的机器学习分类器的输入的标签的修改。自然语言处理（NLP）主要集中于高级攻击方案，例如释义输入文本。我们认为，这些在典型的应用程序场景（例如在社交媒体中）不太现实，而是专注于对角色级别的低级攻击。在人类的认知能力和人类鲁棒性的指导下，我们提出了低级对抗性攻击的第一个大规模目录和基准，我们将其称为Zéroe，其中包括九种不同的攻击模式，包括视觉和语音对手。我们表明，NLP目前的主力军罗伯塔（Roberta）因攻击而失败。我们的数据集为测试未来更类似人类的NLP模型的鲁棒性提供了基准。

Adversarial attacks are label-preserving modifications to inputs of machine learning classifiers designed to fool machines but not humans. Natural Language Processing (NLP) has mostly focused on high-level attack scenarios such as paraphrasing input texts. We argue that these are less realistic in typical application scenarios such as in social media, and instead focus on low-level attacks on the character-level. Guided by human cognitive abilities and human robustness, we propose the first large-scale catalogue and benchmark of low-level adversarial attacks, which we dub Zéroe, encompassing nine different attack modes including visual and phonetic adversaries. We show that RoBERTa, NLP's current workhorse, fails on our attacks. Our dataset provides a benchmark for testing robustness of future more human-like NLP models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题