论文标题
数据气味:类别,原因和后果,以及基于AI的系统中可疑数据的检测
Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-based Systems
论文作者
论文摘要
高数据质量对于当今基于AI的系统至关重要。但是,尽管数据质量一直是研究的对象,但对于潜在的数据质量问题(例如,模棱两可的,无关的价值)显然缺乏研究。这类问题本质上是潜在的,因此通常并不明显。然而,它们可能与基于AI的系统(例如技术债务,数据引起的故障)的未来问题的风险增加有关。作为软件工程中代码气味的同行,我们指的是数据气味的问题。本文概念化了数据的气味,并在基于AI的系统的背景下的原因,后果,检测和使用。此外,出现了36个数据气味的目录分为三类(即可信度的气味,可理解的气味,一致性的气味)。此外,该文章概述了用于检测数据气味的工具支持,并提出了240多个现实世界数据集中初始气味检测的结果。
High data quality is fundamental for today's AI-based systems. However, although data quality has been an object of research for decades, there is a clear lack of research on potential data quality issues (e.g., ambiguous, extraneous values). These kinds of issues are latent in nature and thus often not obvious. Nevertheless, they can be associated with an increased risk of future problems in AI-based systems (e.g., technical debt, data-induced faults). As a counterpart to code smells in software engineering, we refer to such issues as Data Smells. This article conceptualizes data smells and elaborates on their causes, consequences, detection, and use in the context of AI-based systems. In addition, a catalogue of 36 data smells divided into three categories (i.e., Believability Smells, Understandability Smells, Consistency Smells) is presented. Moreover, the article outlines tool support for detecting data smells and presents the result of an initial smell detection on more than 240 real-world datasets.