论文标题
我们在岩石上建造吗?关于数据预处理对于代码摘要的重要性
Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization
论文作者
论文摘要
代码摘要是给定代码生成有用注释的任务,长期以来一直很感兴趣。大多数现有的代码汇总模型经过广泛使用的代码评论基准数据集的培训和验证。但是,关于由现实世界项目构建的基准数据集的质量知之甚少。基准数据集是否如预期的那样好吗?为了弥合差距,我们进行了一项系统的研究,以评估和提高四个基准数据集的质量广泛用于代码摘要任务。首先,我们提出了一个自动化代码调查工具,该工具可以准确地检测由现有基准数据集中不适当的数据预处理操作引起的嘈杂数据。然后,我们使用该工具根据检测到的噪声进一步评估四个基准数据集的数据质量。最后,我们进行比较实验,以研究嘈杂数据对代码汇总模型性能的影响。结果表明,这些数据预处理噪声在所有四个基准数据集中都广泛存在,并且删除这些嘈杂的数据可显着改善代码摘要的性能。我们认为,发现和见解将使对代码摘要任务中的数据质量有更好的了解,并为相关的研究和实践铺平道路。
Code summarization, the task of generating useful comments given the code, has long been of interest. Most of the existing code summarization models are trained and validated on widely-used code comment benchmark datasets. However, little is known about the quality of the benchmark datasets built from real-world projects. Are the benchmark datasets as good as expected? To bridge the gap, we conduct a systematic research to assess and improve the quality of four benchmark datasets widely used for code summarization tasks. First, we propose an automated code-comment cleaning tool that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets. Then, we apply the tool to further assess the data quality of the four benchmark datasets, based on the detected noises. Finally, we conduct comparative experiments to investigate the impact of noisy data on the performance of code summarization models. The results show that these data preprocessing noises widely exist in all four benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization. We believe that the findings and insights will enable a better understanding of data quality in code summarization tasks, and pave the way for relevant research and practice.