使用有限的Twitter数据的性别预测

论文标题

使用有限的Twitter数据的性别预测

Gender prediction using limited Twitter Data

论文作者

Burghoorn, Maaike, de Boer, Maaike H. T., Raaijmakers, Stephan

论文摘要

变压器模型在各种NLP任务上显示出令人印象深刻的性能。对于特定的NLP分类任务，可以对现成的预训练模型进行微调，从而减少了大量额外培训数据的需求。但是，很少的研究已经解决了准确调整此类预训练的变压器模型以及准确预测需要多少数据所需的数据。本文探讨了BERT（用于单词嵌入的变压器模型）在社交媒体上预测的可用性。法医应用包括检测性别混淆，例如男性在聊天室中冒充女性。荷兰BERT模型在标记为性别的荷兰Twitter数据集的不同样本上进行了微调，每人使用的推文数量有所不同。结果表明，鉴定伯特在每人只有200个推文中进行了填充时，有助于良好的性别分类表现（80％F1）。但是，当每人仅使用20条推文时，我们的分类器的性能会降低（至70％F1）。这些结果表明，即使有了相对较少的数据，BERT也可以进行微调以准确地帮助预测Twitter用户的性别，因此，只要仅限较低的推文就可以确定性别。这为快速检测性别开辟了操作观点。

Transformer models have shown impressive performance on a variety of NLP tasks. Off-the-shelf, pre-trained models can be fine-tuned for specific NLP classification tasks, reducing the need for large amounts of additional training data. However, little research has addressed how much data is required to accurately fine-tune such pre-trained transformer models, and how much data is needed for accurate prediction. This paper explores the usability of BERT (a Transformer model for word embedding) for gender prediction on social media. Forensic applications include detecting gender obfuscation, e.g. males posing as females in chat rooms. A Dutch BERT model is fine-tuned on different samples of a Dutch Twitter dataset labeled for gender, varying in the number of tweets used per person. The results show that finetuning BERT contributes to good gender classification performance (80% F1) when finetuned on only 200 tweets per person. But when using just 20 tweets per person, the performance of our classifier deteriorates non-steeply (to 70% F1). These results show that even with relatively small amounts of data, BERT can be fine-tuned to accurately help predict the gender of Twitter users, and, consequently, that it is possible to determine gender on the basis of just a low volume of tweets. This opens up an operational perspective on the swift detection of gender.

下载PDF全文

下载文献需遵守相关版权规定

论文标题