论文标题

层次结构的医学文档理解

Hierarchical BERT for Medical Document Understanding

论文作者

Zhang, Ning, Jankowski, Maciej

论文摘要

医疗文件的理解最近引起了很多关注。一项代表性的任务是国际疾病分类(ICD)诊断代码分配。现有工作将RNN或CNN作为骨干网络,因为香草·伯特(Vanilla Bert)无法处理长期文档(> 2000 to Kens)。在所有这些方法中共有的一个问题是,它们与ICD代码分配任务相关,失去了通用性,以使整个文档级别和句子级层嵌入。结果,将它们引导到其他下游NLU任务并不是直接的。在这些观察结果的激励下,我们提出了医疗文件Bert(Mdbert),以了解长期的医疗文件理解任务。 Mdbert不仅可以有效地学习语义级别的学习表示,而且还可以通过利用自下而上的层次结构来有效编码长文档。与Vanilla Bert解决方案相比:1,Mdbert在模拟物III数据集上将性能提高到相对20%,使其与当前的SOTA解决方案相当; 2,它使自我发项模块上的计算复杂性降低到小于1/100。除了ICD代码分配之外,我们还在一个名为TrialTrove的大型商业数据集上执行了其他NLU任务,以展示Mdbert在提供不同语义水平的力量方面的实力。

Medical document understanding has gained much attention recently. One representative task is the International Classification of Disease (ICD) diagnosis code assignment. Existing work adopts either RNN or CNN as the backbone network because the vanilla BERT cannot handle well long documents (>2000 to kens). One issue shared across all these approaches is that they are over specific to the ICD code assignment task, losing generality to give the whole document-level and sentence-level embedding. As a result, it is not straight-forward to direct them to other downstream NLU tasks. Motivated by these observations, we propose Medical Document BERT (MDBERT) for long medical document understanding tasks. MDBERT is not only effective in learning representations at different levels of semantics but efficient in encoding long documents by leveraging a bottom-up hierarchical architecture. Compared to vanilla BERT solutions: 1, MDBERT boosts the performance up to relatively 20% on the MIMIC-III dataset, making it comparable to current SOTA solutions; 2, it cuts the computational complexity on self-attention modules to less than 1/100. Other than the ICD code assignment, we conduct a variety of other NLU tasks on a large commercial dataset named as TrialTrove, to showcase MDBERT's strength in delivering different levels of semantics.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源