Text2 -struct：用于采矿结构化数据的机器学习管道

论文标题

Text2 -struct：用于采矿结构化数据的机器学习管道

Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text

论文作者

Zhou, Chaochao, Yang, Bo

论文摘要

许多分析和预测任务都需要从非结构化文本中提取结构化数据。但是，尚未用于培训机器学习模型的注释方案和培训数据集，无法从文本中开采结构化数据，而没有特殊的模板和模式。为了解决它，本文介绍了端到端的机器学习管道，文本2结构，包括文本注释方案，培训数据处理和机器学习实现。我们将采矿问题提出为与文本中数字相关的指标和单位的提取。使用带注释的文本数据集对Text2 -Juststruct进行了培训和评估，这些数据集是从有关血栓切除术的医学出版物摘要中收集的。在预测性能方面，在测试数据集上实现了0.82的骰子系数。通过随机抽样，数字和实体之间的大多数预测关系与地面真相注释很好地匹配。这些结果表明，Text2 -struct对于无特殊模板或模式的文本挖掘结构化数据是可行的。预计通过扩展数据集并研究其他机器学习模型，可以进一步改善管道。可以在：https：//github.com/zcc861007/text2struct上找到代码演示

Many analysis and prediction tasks require the extraction of structured data from unstructured texts. However, an annotation scheme and a training dataset have not been available for training machine learning models to mine structured data from text without special templates and patterns. To solve it, this paper presents an end-to-end machine learning pipeline, Text2Struct, including a text annotation scheme, training data processing, and machine learning implementation. We formulated the mining problem as the extraction of metrics and units associated with numerals in the text. Text2Struct was trained and evaluated using an annotated text dataset collected from abstracts of medical publications regarding thrombectomy. In terms of prediction performance, a dice coefficient of 0.82 was achieved on the test dataset. By random sampling, most predicted relations between numerals and entities were well matched to the ground-truth annotations. These results show that Text2Struct is viable for the mining of structured data from text without special templates or patterns. It is anticipated to further improve the pipeline by expanding the dataset and investigating other machine learning models. A code demonstration can be found at: https://github.com/zcc861007/Text2Struct

下载PDF全文

下载文献需遵守相关版权规定

论文标题