论文标题
带有后卫的预培训语言模型的水印
Watermarking Pre-trained Language Models with Backdooring
论文作者
论文摘要
事实证明,大型预训练的语言模型(PLM)是现代自然语言处理系统的关键组成部分。 PLM通常需要在特定于任务的下游数据集上进行微调,这使得由于灾难性的遗忘现象,因此很难声称PLM的所有权并保护开发人员的知识产权。我们证明,PLM可以通过嵌入由所有者定义的特定输入触发的后门来通过多任务学习框架进行水印,即使在多个下游任务上进行了多个水印PLM,这些水印也很难去除。除了使用一些稀有单词作为触发器外,我们还表明,通用单词的组合可以用作后门触发器,以避免容易检测到它们。在多个数据集上进行的广泛实验表明,嵌入式水印可以以高成功率进行稳健提取,并且受到随访微调的影响较小。
Large pre-trained language models (PLMs) have proven to be a crucial component of modern natural language processing systems. PLMs typically need to be fine-tuned on task-specific downstream datasets, which makes it hard to claim the ownership of PLMs and protect the developer's intellectual property due to the catastrophic forgetting phenomenon. We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners, and those watermarks are hard to remove even though the watermarked PLMs are fine-tuned on multiple downstream tasks. In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected. Extensive experiments on multiple datasets demonstrate that the embedded watermarks can be robustly extracted with a high success rate and less influenced by the follow-up fine-tuning.