从公共现有资源开采的用于药物安全应用的大规模Twitter数据集

论文标题

从公共现有资源开采的用于药物安全应用的大规模Twitter数据集

A large-scale Twitter dataset for drug safety applications mined from publicly existing resources

论文作者

Tekumalla, Ramya, Banda, Juan M.

论文摘要

随着自然语言处理（NLP）任务的深度学习模型的普及，在药物宣传领域，更具体地是在识别不良药物反应（ADRS）的情况下，对针对此类任务的大型社交中数据集有固有的需求。大多数研究人员分配了大量时间来爬行或购买昂贵的预策划数据集，然后通过人类手动注释，这些方法并不能很好地扩展，因为越来越多的数据在Twitter中持续流动。在这项工作中，我们重新使用了超过94亿条推文的公开可用的存档数据集，目的是创建与药物使用相关的推文的大型数据集。然后，我们使用文献中现有的手动策划数据集，然后使用机器学习方法验证过滤的推文，以实现相关性，并最终导致公开可用的数据集为1,1819.93亿条推文，以供公众使用。我们提供有关如何提取此数据集以及所选的推文ID的所有代码和详细过程，以供研究人员使用。

With the increase in popularity of deep learning models for natural language processing (NLP) tasks, in the field of Pharmacovigilance, more specifically for the identification of Adverse Drug Reactions (ADRs), there is an inherent need for large-scale social-media datasets aimed at such tasks. With most researchers allocating large amounts of time to crawl Twitter or buying expensive pre-curated datasets, then manually annotating by humans, these approaches do not scale well as more and more data keeps flowing in Twitter. In this work we re-purpose a publicly available archived dataset of more than 9.4 billion Tweets with the objective of creating a very large dataset of drug usage-related tweets. Using existing manually curated datasets from the literature, we then validate our filtered tweets for relevance using machine learning methods, with the end result of a publicly available dataset of 1,181,993 million tweets for public use. We provide all code and detailed procedure on how to extract this dataset and the selected tweet ids for researchers to use.

下载PDF全文

下载文献需遵守相关版权规定

论文标题