论文标题
半自行的自动化ICD编码
Semi-self-supervised Automated ICD Coding
论文作者
论文摘要
临床文本注释(CTN)包含医生的推理过程,以非结构化的自由文本格式编写,他们检查和采访患者。近年来,已经发表了几项研究,这些研究为机器学习的实用性提供了预测医生从CTN诊断的效用,该任务称为ICD编码。数据注释很耗时,尤其是在需要一定程度的专业化时,就像医疗数据一样。本文介绍了一种以半自我监督的方式增强冰岛CTN的稀疏注释数据集的方法。我们在一小部分带注释的CTN上训练神经网络,并使用它从一组未经通知的CTN中提取临床特征。这些临床特征包括大约一千个潜在问题的答案,医生可能会在患者咨询期间找到答案。然后,这些功能用于训练分类器以诊断某些类型的疾病。我们向医生报告了该数据增强方法评估的结果。我们的数据增强方法显示出显着的积极作用,当检查患者的临床特征并提供诊断时,这种效果会降低。我们建议使用基于不包括考试或测试的临床特征做出决策的系统来增强稀缺数据集的方法。
Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free text format, as they examine and interview patients. In recent years, several studies have been published that provide evidence for the utility of machine learning for predicting doctors' diagnoses from CTNs, a task known as ICD coding. Data annotation is time consuming, particularly when a degree of specialization is needed, as is the case for medical data. This paper presents a method of augmenting a sparsely annotated dataset of Icelandic CTNs with a machine-learned imputation in a semi-self-supervised manner. We train a neural network on a small set of annotated CTNs and use it to extract clinical features from a set of un-annotated CTNs. These clinical features consist of answers to about a thousand potential questions that a physician might find the answers to during a consultation of a patient. The features are then used to train a classifier for the diagnosis of certain types of diseases. We report the results of an evaluation of this data augmentation method over three tiers of data availability to the physician. Our data augmentation method shows a significant positive effect which is diminished when clinical features from the examination of the patient and diagnostics are made available. We recommend our method for augmenting scarce datasets for systems that take decisions based on clinical features that do not include examinations or tests.