论文标题
部分可观测时空混沌系统的无模型预测
Hope Speech Detection on Social Media Platforms
论文作者
论文摘要
由于个人计算机在消费市场中广泛使用,因此互联网上的有害内容量大大扩展。简而言之,有害内容是在网上造成人痛苦或伤害的任何东西。它可能包括仇恨言论,暴力内容,威胁,非言语等。在线内容必须是积极的,令人振奋的和支持的。在过去的几年中,许多研究集中在通过仇恨言论检测来解决这个问题,但很少有专注于识别希望言语。本文讨论了各种机器学习方法,以将句子识别为希望语音,非自愿言语或中性句子。研究中使用的数据集包含英语YouTube评论,并作为共享任务“ EACL-2021:希望语音检测有关平等,多样性和包容性”的一部分发布。最初,从共享任务中获得的数据集有三个类别:希望语音,非主语,而不是英语;但是,经过更深入的检查,我们发现需要数据集重新标记。聘请了一组本科生来帮助执行整个数据集的重新标记任务。我们在重新标记的数据上尝试了常规的机器学习模型(例如幼稚的贝叶斯,逻辑回归和支持向量机)和预训练的模型(例如BERT)。根据实验结果,与原始数据集相比,重新标记的数据获得了希望语音识别的准确性。
Since personal computers became widely available in the consumer market, the amount of harmful content on the internet has significantly expanded. In simple terms, harmful content is anything online which causes a person distress or harm. It may include hate speech, violent content, threats, non-hope speech, etc. The online content must be positive, uplifting and supportive. Over the past few years, many studies have focused on solving this problem through hate speech detection, but very few focused on identifying hope speech. This paper discusses various machine learning approaches to identify a sentence as Hope Speech, Non-Hope Speech, or a Neutral sentence. The dataset used in the study contains English YouTube comments and is released as a part of the shared task "EACL-2021: Hope Speech Detection for Equality, Diversity, and Inclusion". Initially, the dataset obtained from the shared task had three classes: Hope Speech, non-Hope speech, and not in English; however, upon deeper inspection, we discovered that dataset relabeling is required. A group of undergraduates was hired to help perform the entire dataset's relabeling task. We experimented with conventional machine learning models (such as Naïve Bayes, logistic regression and support vector machine) and pre-trained models (such as BERT) on relabeled data. According to the experimental results, the relabeled data has achieved a better accuracy for Hope speech identification than the original data set.