论文标题

Nerel-bio:用嵌套命名实体注释的生物医学摘要数据集

NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities

论文作者

Loukachevitch, Natalia, Manandhar, Suresh, Baral, Elina, Rozhkov, Igor, Braslavski, Pavel, Ivanov, Vladimir, Batura, Tatiana, Tutubalina, Elena

论文摘要

本文描述了Nerel-Bio-俄语中的PubMed摘要的注释方案和语料库,英语中的摘要数量较少。 Nerel-bio通过引入特定于域的实体类型来扩展通用域数据集nerel。 Nerel-BiO注释方案涵盖了一般和生物医学领域,使其适合域转移实验。 Nerel-bio为嵌套命名实体提供注释,以扩展为nerel使用的计划。嵌套的命名实体可能会跨越实体边界,以连接到嵌套在较长实体中的较短实体,从而使它们更难检测到。 Nerel-bio包含700多种俄罗斯和100+英语摘要的注释。所有英语PubMed注释都有相应的俄罗斯同行。因此,Nerel-bio包含以下特定特征:嵌套命名实体的注释,它可以用作交叉域(nerel-> nerel-bio)和跨语言(英语 - >俄语)转移的基准。我们尝试基于变压器的序列模型和机器阅读理解(MRC)模型,并报告其结果。 该数据集可在https://github.com/nerel-ds/nerel-bio上免费获得。

This paper describes NEREL-BIO -- an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect. NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL -> NEREL-BIO) and cross-language (English -> Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension (MRC) models and report their results. The dataset is freely available at https://github.com/nerel-ds/NEREL-BIO.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源