论文标题

戈特伯特:纯德语模型

GottBERT: a pure German Language Model

论文作者

Scheible, Raphael, Thomczyk, Fabian, Tippmann, Patric, Jaravine, Victor, Boeker, Martin

论文摘要

最近,预训练的语言模型推进了自然语言处理领域(NLP)。引入了变压器(BERT)及其优化版本Roberta的双向编码器产生了重大影响,并提高了预训练模型的相关性。首先,该领域的研究主要始于英语数据,然后是经过多语言文本语料库训练的模型。但是,当前的研究表明,多语言模型不如单语模型。目前,我们在这项工作(Gottbert)中介绍的德语单语罗伯塔模型尚未出版。奥斯卡数据集的德国部分用作文本语料库。在评估中,我们将其在两个命名实体识别(NER)任务Conll 2003和Germeval 2014以及文本分类任务Germeval 2018和GNAD和现有德语单语言BERT模型和两个多语言的文本分类任务上进行了比较。 Gottbert使用Fairseq的原始Roberta模型进行了预培训。所有下游任务均使用从德国伯特(Derman Bert)的基准获得的超参数预设进行培训。实验是利用农场的设置。性能由$ f_ {1} $得分测量。戈特伯特(Gottbert)使用Roberta Base Architecture在256个核心TPU POD上成功训练。即使没有广泛的超参数优化,在所有NER和一个文本分类任务中,Gottbert都已经超越了所有其他经过测试的德语和多语言模型。为了支持德国NLP字段,我们根据AGPLV3许可发布Gottbert。

Lately, pre-trained language models advanced the field of natural language processing (NLP). The introduction of Bidirectional Encoders for Transformers (BERT) and its optimized version RoBERTa have had significant impact and increased the relevance of pre-trained models. First, research in this field mainly started on English data followed by models trained with multilingual text corpora. However, current research shows that multilingual models are inferior to monolingual models. Currently, no German single language RoBERTa model is yet published, which we introduce in this work (GottBERT). The German portion of the OSCAR data set was used as text corpus. In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones. GottBERT was pre-trained related to the original RoBERTa model using fairseq. All downstream tasks were trained using hyperparameter presets taken from the benchmark of German BERT. The experiments were setup utilizing FARM. Performance was measured by the $F_{1}$ score. GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture. Even without extensive hyper-parameter optimization, in all NER and one text classification task, GottBERT already outperformed all other tested German and multilingual models. In order to support the German NLP field, we publish GottBERT under the AGPLv3 license.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源