基于机器学习的自动注释和COVID-19假新闻的检测

论文标题

基于机器学习的自动注释和COVID-19假新闻的检测

Machine Learning-based Automatic Annotation and Detection of COVID-19 Fake News

论文作者

Akhtar, Mohammad Majid, Sharma, Bibhas, Karunanayake, Ishan, Masood, Rahat, Ikram, Muhammad, Kanhere, Salil S.

论文摘要

Covid-19影响了世界各地，尽管对爆发的错误信息的传播速度比病毒更快。错误信息通过在线社交网络（OSN）经常误导人们遵循正确的医疗实践。特别是，OSN机器人一直是传播虚假信息并发起网络宣传的主要来源。现有工作忽略了机器人的存在，这些机器人在传播中充当催化剂，并专注于“帖子中共享的文章”而不是帖子（文本）内容中的假新闻检测。大多数关于错误信息检测的工作都使用手动标记的数据集，这些数据集很难扩展以构建其预测模型。在这项研究中，我们通过在Twitter数据集上使用经过验证的事实检查的语句标记数据来克服数据稀缺性挑战。此外，我们将文本功能与用户级功能（例如关注者计数和朋友计数）和推文级功能（例如Tweet中的提及数，主题标签和URL）相结合，以充当检测错误信息的其他指标。此外，我们分析了推文中机器人的存在，并表明机器人随着时间的流逝改变了其行为，并且在错误信息中最活跃。我们收集了1022万个Covid-19相关推文，并使用我们的注释模型来构建一个广泛而原始的地面真相数据集以进行分类。我们利用各种机器学习模型来准确检测错误信息，我们的最佳分类模型达到了精度（82％），召回（96％）和假阳性率（3.58％）。此外，我们的机器人分析表明，机器人产生了大约10％的错误信息推文。我们的方法可以实质性地暴露于虚假信息，从而改善了通过社交媒体平台传播的信息的可信度。

COVID-19 impacted every part of the world, although the misinformation about the outbreak traveled faster than the virus. Misinformation spread through online social networks (OSN) often misled people from following correct medical practices. In particular, OSN bots have been a primary source of disseminating false information and initiating cyber propaganda. Existing work neglects the presence of bots that act as a catalyst in the spread and focuses on fake news detection in 'articles shared in posts' rather than the post (textual) content. Most work on misinformation detection uses manually labeled datasets that are hard to scale for building their predictive models. In this research, we overcome this challenge of data scarcity by proposing an automated approach for labeling data using verified fact-checked statements on a Twitter dataset. In addition, we combine textual features with user-level features (such as followers count and friends count) and tweet-level features (such as number of mentions, hashtags and urls in a tweet) to act as additional indicators to detect misinformation. Moreover, we analyzed the presence of bots in tweets and show that bots change their behavior over time and are most active during the misinformation campaign. We collected 10.22 Million COVID-19 related tweets and used our annotation model to build an extensive and original ground truth dataset for classification purposes. We utilize various machine learning models to accurately detect misinformation and our best classification model achieves precision (82%), recall (96%), and false positive rate (3.58%). Also, our bot analysis indicates that bots generated approximately 10% of misinformation tweets. Our methodology results in substantial exposure of false information, thus improving the trustworthiness of information disseminated through social media platforms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题