论文标题
从技术文档中提取程序知识
Extracting Procedural Knowledge from Technical Documents
论文作者
论文摘要
程序是文档的重要知识组成部分,可以由认知助手可以利用自动化,提问或推动对话来利用。解析大量密集文档(例如产品手册),用户指南以自动了解哪些零件在谈论程序并随后提取它们是一个具有挑战性的问题。大多数现有研究都集中在给定程序中提取流量或了解程序以回答概念问题。从各种格式的文档中自动识别和提取多个程序仍然是一个相对较少的问题。在这项工作中,我们通过-1)提供有关文档的结构和语言特性如何分组以定义程序类型的见解,2)分析文档以提取相关的语言和结构属性,3)图案将程序识别为分类问题,从而使文档的特征从上述分析中衍生出来。我们首先实施并部署了在不同用例中使用的无监督技术。根据不同用例的评估,我们弄清楚了无监督方法的弱点。然后,我们设计了一个改进的版本,该版本已被监督。我们证明我们的技术可以通过达到89%的精度来有效地从大而复杂的文档中识别程序。
Procedures are an important knowledge component of documents that can be leveraged by cognitive assistants for automation, question-answering or driving a conversation. It is a challenging problem to parse big dense documents like product manuals, user guides to automatically understand which parts are talking about procedures and subsequently extract them. Most of the existing research has focused on extracting flows in given procedures or understanding the procedures in order to answer conceptual questions. Identifying and extracting multiple procedures automatically from documents of diverse formats remains a relatively less addressed problem. In this work, we cover some of this ground by -- 1) Providing insights on how structural and linguistic properties of documents can be grouped to define types of procedures, 2) Analyzing documents to extract the relevant linguistic and structural properties, and 3) Formulating procedure identification as a classification problem that leverages the features of the document derived from the above analysis. We first implemented and deployed unsupervised techniques which were used in different use cases. Based on the evaluation in different use cases, we figured out the weaknesses of the unsupervised approach. We then designed an improved version which was supervised. We demonstrate that our technique is effective in identifying procedures from big and complex documents alike by achieving accuracy of 89%.