论文标题
迈向程序公平:发现有毒语言分类器如何使用情感信息的偏见
Towards Procedural Fairness: Uncovering Biases in How a Toxic Language Classifier Uses Sentiment Information
论文作者
论文摘要
以前关于有毒语言分类器的公平性的著作将具有不同身份项作为输入特征的模型的输出进行了比较,但不考虑上下文中存在的其他重要概念的影响。在这里,除了身份术语外,我们还考虑了分类器学到的高级潜在特征,并研究了这些功能和身份术语之间的相互作用。对于多类有毒语言分类器,我们利用基于概念的解释框架来计算模型对情感概念的敏感性,该框架以前已用作有毒语言检测的显着特征。我们的结果表明,尽管对于某些课程,分类器已经按照预期学习了情感信息,但该信息被身份术语作为输入功能的影响而超过。这项工作是评估程序公平性的一步,在这种情况下,不公平的过程导致不公平的结果。所产生的知识可以指导伪造技术,以确保在培训数据集中有很好的代表性术语以外的重要概念。
Previous works on the fairness of toxic language classifiers compare the output of models with different identity terms as input features but do not consider the impact of other important concepts present in the context. Here, besides identity terms, we take into account high-level latent features learned by the classifier and investigate the interaction between these features and identity terms. For a multi-class toxic language classifier, we leverage a concept-based explanation framework to calculate the sensitivity of the model to the concept of sentiment, which has been used before as a salient feature for toxic language detection. Our results show that although for some classes, the classifier has learned the sentiment information as expected, this information is outweighed by the influence of identity terms as input features. This work is a step towards evaluating procedural fairness, where unfair processes lead to unfair outcomes. The produced knowledge can guide debiasing techniques to ensure that important concepts besides identity terms are well-represented in training datasets.