论文标题

SMTCE:越南的社交媒体文本分类评估基准和伯语模型

SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

论文作者

Nguyen, Luan Thanh, Van Nguyen, Kiet, Nguyen, Ngan Luu-Thuy

论文摘要

文本分类是具有各种有趣应用程序的典型自然语言处理或计算语言学任务。随着社交媒体平台上的用户数量的增加,数据加速促进了有关社交媒体文本分类(SMTC)或社交媒体文本挖掘的新兴研究。与英语相比,越南语是低资源语言之一,仍然没有集中精力并彻底利用。受胶水成功的启发,我们介绍了社交媒体文本分类评估(SMTCE)基准,作为各种SMTC任务的数据集和模型的集合。借助提出的基准测试,我们实施和分析了各种基于BERT的模型(Mbert,XLM-R和Distilmbert)和单语言基于BERT的模型(Phobert,Vibert,Vibert,Velectra和Vibert4news)的有效性。单语模型优于多语言模型,并实现所有文本分类任务的最新结果。它提供了基于基准的多语言和单语言模型的客观评估,该模型将使越南语言中关于伯集期的未来研究有益。

Text classification is a typical natural language processing or computational linguistics task with various interesting applications. As the number of users on social media platforms increases, data acceleration promotes emerging studies on Social Media Text Classification (SMTC) or social media text mining on these valuable resources. In contrast to English, Vietnamese, one of the low-resource languages, is still not concentrated on and exploited thoroughly. Inspired by the success of the GLUE, we introduce the Social Media Text Classification Evaluation (SMTCE) benchmark, as a collection of datasets and models across a diverse set of SMTC tasks. With the proposed benchmark, we implement and analyze the effectiveness of a variety of multilingual BERT-based models (mBERT, XLM-R, and DistilmBERT) and monolingual BERT-based models (PhoBERT, viBERT, vELECTRA, and viBERT4news) for tasks in the SMTCE benchmark. Monolingual models outperform multilingual models and achieve state-of-the-art results on all text classification tasks. It provides an objective assessment of multilingual and monolingual BERT-based models on the benchmark, which will benefit future studies about BERTology in the Vietnamese language.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源