论文标题

使用基于时间的GIT数据的陷阱和指南

Pitfalls and Guidelines for Using Time-Based Git Data

论文作者

Flint, Samuel W., Chauhan, Jigyasa, Dyer, Robert

论文摘要

许多软件工程研究论文依赖于基于时间的数据(例如,提交时间戳,发行报告创建/更新/关闭日期,发布日期)。但是,像大多数实际数据一样,基于时间的数据通常很脏。迄今为止,尚无量化软件工程研究社区使用此类数据的频率,或者调查并量化此类数据的来源的频率。根据所使用的研究任务和方法,包括此类肮脏数据可能会影响研究结果。本文介绍了对利用基于时间数据的论文的扩展调查,该论文发表在采矿软件存储库(MSR)会议系列中。在2004--2021中发表的754个技术轨道和数据论文中,我们看到至少有290(38%)的论文使用了基于时间的数据。我们还观察到,研究论文中使用的大多数基于时间的数据通常来自GitHub的Git提交形式。然后,根据这些结果,我们使用BOA和软件遗产基础架构来识别和量化多种肮脏的GIT TIMESTAMP数据来源。最后,我们为研究人员提供指南/最佳实践,利用来自GIT存储库的基于时间的数据。

Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could affect the research results. This paper presents an extended survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 754 technical track and data papers published in MSR 2004--2021, we saw at least 290 (38%) papers utilized time-based data. We also observed that most time-based data used in research papers comes in the form of Git commits, often from GitHub. Based on those results, we then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty Git timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源