用于剪接位点预测的序列标记和DNABERT在Homo Sapiens DNA中的预测

论文标题

用于剪接位点预测的序列标记和DNABERT在Homo Sapiens DNA中的预测

Sequential Labelling and DNABERT For Splice Site Prediction in Homo Sapiens DNA

论文作者

Leksono, Muhammad Anwari, Purwarianti, Ayu

论文摘要

在过去的几年中，基因组测序技术已经显着改善，并产生了丰富的遗传数据。人工智能已被用来分析遗传数据，以响应其纯粹的大小和可变性。已经使用各种深度学习架构来发现剪接位点，从而发现内含子和外显子区域，从而进行了单个DNA的基因预测。最近的预测是用固定剪接位点位置训练的模型进行的，这消除了单个序列中存在多个剪接位点的可能性。本文提出了顺序标记，以预测剪接位点，无论其序列的位置如何。顺序标记是在DNA上进行的，以确定内含子和外显子区域，从而发现剪接位点。使用的顺序标记模型基于经过训练的人类基因组训练的验证dnabert-3。测试了微调和基于功能的方法。针对为突变类型和位置预测设计的最新顺序标记模型，对建议的模型进行了基准测试。在验证数据上获得高F1分数的同时，基线和建议的模型在测试数据上的表现较差。误差和测试结果分析表明，模型经验过于拟合，因此，模型被认为不适合剪接站点预测。

Genome sequencing technology has improved significantly in few last years and resulted in abundance genetic data. Artificial intelligence has been employed to analyze genetic data in response to its sheer size and variability. Gene prediction on single DNA has been conducted using various deep learning architectures to discover splice sites and therefore discover intron and exon region. Recent predictions are carried out with models trained on sequence with fixed splice site location which eliminates possibility of multiple splice sites existence in single sequence. This paper proposes sequential labelling to predict splice sites regardless their position in sequence. Sequential labelling is carried out on DNA to determine intron and exon region and thus discover splice sites. Sequential labelling models used are based on pretrained DNABERT-3 which has been trained on human genome. Both fine-tuning and feature-based approach are tested. Proposed model is benchmarked against latest sequential labelling model designed for mutation type and location prediction. While achieving high F1 scores on validation data, both baseline and proposed model perform poorly on test data. Error and test results analysis reveal that model experience overfitting and therefore, model is deemed not suitable for splice site prediction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题