论文标题
从非结构化文档中提取软件要求
Extracting Software Requirements from Unstructured Documents
论文作者
论文摘要
文本文档或提取中的需求识别是许多研究人员建议自动化的繁琐和错误的任务。我们手动注释了纯数据集,因此创建了一个包含需求和非要求的新数据集。使用此数据集,我们对BERT模型进行了微调,并将结果与诸如FastText和Elmo等多个基线进行比较。为了在语义上更复杂的文档上评估模型,我们将纯数据集结果与按要求的信息(RFI)文档进行了比较。 RFI通常包含软件要求,但以较低的标准化方式。对二进制句子分类任务的纯数据集上的微调BERT在纯数据集上显示出令人鼓舞的结果。与以前和最近有关约束意见的研究相比,我们的方法在精确和召回指标方面表明了高性能,同时对非结构化的文本输入不可知。
Requirements identification in textual documents or extraction is a tedious and error prone task that many researchers suggest automating. We manually annotated the PURE dataset and thus created a new one containing both requirements and non-requirements. Using this dataset, we fine-tuned the BERT model and compare the results with several baselines such as fastText and ELMo. In order to evaluate the model on semantically more complex documents we compare the PURE dataset results with experiments on Request For Information (RFI) documents. The RFIs often include software requirements, but in a less standardized way. The fine-tuned BERT showed promising results on PURE dataset on the binary sentence classification task. Comparing with previous and recent studies dealing with constrained inputs, our approach demonstrates high performance in terms of precision and recall metrics, while being agnostic to the unstructured textual input.