论文标题
通过可控的反向生成来构建对话安全性高度归纳背景
Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation
论文作者
论文摘要
久经注的大型语言模型很容易产生有毒或有偏见的内容,这对于实际使用而言是过于使用的。为了检测这种有毒的世代,现有方法依赖于模板,现实世界数据提取,众包工人或自动生成来构建可能引起有毒世代的对抗环境。但是,哪种类型的上下文更有可能诱发不安全的响应仍然不足。在本文中,我们确定上下文毒性和上下文类别(例如,\ textit {profanity},\ textIt {insult},\ textit {drugs}等)是引起响应生成安全性问题的两个重要因素。因此,我们提出了一种称为\ emph {反向生成}的方法,以构建以给定响应为条件的对抗上下文,并具有控制类别,毒性水平和敏感性的灵活性。通过反向生成,我们扩大了现有的坏数据集并构建了一个新的数据集BAD+,该数据集在12个类别中包含超过120k的多样性和高度诱导的环境。我们测试了三种经过验证的对话模型(Blender,Dialogpt和Plato2),并发现Bad+可以在很大程度上暴露其安全问题。此外,我们表明Bad+可以大大提高发电的安全性,并揭示安全改善的关键因素。我们的代码和数据集可在\ url {https://github.com/thu-coai/reverse_generation}上获得。
Large pretrained language models can easily produce toxic or biased content, which is prohibitive for practical use. In order to detect such toxic generations, existing methods rely on templates, real-world data extraction, crowdsourcing workers, or automatic generation to construct adversarial contexts that are likely to induce toxic generations. However, what type of context is more likely to induce unsafe responses is still under-explored. In this paper, we identify that context toxicity and context category (e.g., \textit{profanity}, \textit{insult}, \textit{drugs}, etc.) are two important factors to cause safety issues in response generation. Hence, we propose a method called \emph{reverse generation} to construct adversarial contexts conditioned on a given response, with the flexibility to control category, toxicity level, and inductivity of the generated contexts. Via reverse generation, we augment the existing BAD dataset and construct a new dataset BAD+ which contains more than 120K diverse and highly inductive contexts in 12 categories. We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems. Furthermore, we show that BAD+ can greatly enhance the safety of generation and reveal the key factors of safety improvement. Our code and dataset is available at \url{https://github.com/thu-coai/Reverse_Generation}.