关于生物医学句子相似性的可重现实验调查：一种基于字符串的方法设置了最先进的状态

论文标题

关于生物医学句子相似性的可重现实验调查：一种基于字符串的方法设置了最先进的状态

A reproducible experimental survey on biomedical sentence similarity: a string-based method sets the state of the art

论文作者

Lara-Clares, Alicia, Lastra-Díaz, Juan J., Garcia-Serrano, Ana

论文摘要

该注册报告介绍了有关生物医学句子相似性与以下目的相似性的最大，首次可再现的实验调查：（1）阐明问题的最新状态；（2）解决一些可重复性问题，以防止评估大多数当前方法；（3）评估几种未开发的句子相似性方法；（4）评估未开发的基准测试，称为colpus-transcriptional-crogulation; （5）对预处理阶段和指定实体识别（NER）工具对句子相似性方法的性能的影响进行研究；最后，（6）在这一研究中弥合方法和实验缺乏可重复性资源。我们的实验调查基于一个单个软件平台，该平台配备了详细的可重复性协议和数据集作为补充材料，以允许我们所有实验的精确复制。此外，我们还引入了一种新的基于汇总的句子相似性方法，称为Liblock，以及八种基于本体的方法的八种变体，以及一个在PMC-BIOC语料库中培训的新的预训练的单词嵌入模型。我们的实验表明，我们的新型基于弦乐的度量在生物医学领域的句子相似性任务上设置了新的最新技术，并且显着优于此处评估的所有方法，除了一种基于本体的方法。同样，我们的实验证实，预处理阶段以及NER工具的选择对句子相似性方法的性能有重大影响。我们还详细介绍了当前方法的一些缺点和局限性，并警告需要完善当前的基准测试。最后，一个值得注意的发现是，我们的新基于弦乐的方法显着胜过本文评估的所有最新机器学习模型。

This registered report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most of current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate an unexplored benchmark, called Corpus-Transcriptional-Regulation; (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of reproducibility resources for methods and experiments in this line of research. Our experimental survey is based on a single software platform that is provided with a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments. In addition, we introduce a new aggregated string-based sentence similarity method, called LiBlock, together with eight variants of current ontology-based methods and a new pre-trained word embedding model trained on the full-text articles in the PMC-BioC corpus. Our experiments show that our novel string-based measure sets the new state of the art on the sentence similarity task in the biomedical domain and significantly outperforms all the methods evaluated herein, except one ontology-based method. Likewise, our experiments confirm that the pre-processing stages, and the choice of the NER tool, have a significant impact on the performance of the sentence similarity methods. We also detail some drawbacks and limitations of current methods, and warn on the need of refining the current benchmarks. Finally, a noticeable finding is that our new string-based method significantly outperforms all state-of-the-art Machine Learning models evaluated herein.

下载PDF全文

下载文献需遵守相关版权规定

论文标题