论文标题

用于评估和维修的数据不一致的复杂性和有效算法

Complexity and Efficient Algorithms for Data Inconsistency Evaluating and Repairing

论文作者

Miao, Dongjing, Cai, Zhipeng, Li, Jianzhong, Gao, Xiangyu, Liu, Xianmin

论文摘要

数据不一致评估和维修是数据质量管理的主要问题。作为基本的计算任务,最佳子集修复不仅用于数据库维修进度期间的成本估算,而且还直接用于得出数据库不一致的评估。计算最佳子集修复是从不一致的数据库中找到最小元组,该数据库的删除导致左侧的子集导致左侧。对复杂性和有效算法的紧密结合仍然未知。在本文中,我们提高了现有的复杂性和算法结果,并快速估计了最佳子集修复的大小。我们首先加强了最佳子集修复计算问题的二分法,我们表明它不仅是ApxComplete,而且还可以近似于最佳子集修复,而大多数情况下,其因子高于17/16美元。我们第二次显示$(2-0.5^{\tinyσ-1})$ - 近似近似值,每当给定$σ$函数依赖性时,$(2-η_k+\ frac {η_k} {k} {k} {k} {k})$ - 近似 - 大约$ k> 1美元的属性。我们最终显示了一个sublinear估计器,该估计器对子集查询的最佳\ textit {s}修复的大小,它输出的估计值$ 2N+εn$具有很高的可能性,从而得出了fd续签程度的估计值的比率$ 2+ε$。为了支持各种用于FD固有性评估的子集查询,我们将它们统一为$ \ subseteq $ -Oracle,可以回答会员资格查询,并在给出数字$ p $的情况下返回$ p $ p $ pug均匀采样。实验是在范围查询上作为$ \ subseteq $ -Oracle的实现进行的,结果显示了我们的FD结合度估计器的效率。

Data inconsistency evaluating and repairing are major concerns in data quality management. As the basic computing task, optimal subset repair is not only applied for cost estimation during the progress of database repairing, but also directly used to derive the evaluation of database inconsistency. Computing an optimal subset repair is to find a minimum tuple set from an inconsistent database whose remove results in a consistent subset left. Tight bound on the complexity and efficient algorithms are still unknown. In this paper, we improve the existing complexity and algorithmic results, together with a fast estimation on the size of optimal subset repair. We first strengthen the dichotomy for optimal subset repair computation problem, we show that it is not only APXcomplete, but also NPhard to approximate an optimal subset repair with a factor better than $17/16$ for most cases. We second show a $(2-0.5^{\tinyσ-1})$-approximation whenever given $σ$ functional dependencies, and a $(2-η_k+\frac{η_k}{k})$-approximation when an $η_k$-portion of tuples have the $k$-quasi-Tur$\acute{\text{a}}$n property for some $k>1$. We finally show a sublinear estimator on the size of optimal \textit{S}-repair for subset queries, it outputs an estimation of a ratio $2n+εn$ with a high probability, thus deriving an estimation of FD-inconsistency degree of a ratio $2+ε$. To support a variety of subset queries for FD-inconsistency evaluation, we unify them as the $\subseteq$-oracle which can answer membership-query, and return $p$ tuples uniformly sampled whenever given a number $p$. Experiments are conducted on range queries as an implementation of $\subseteq$-oracle, and results show the efficiency of our FD-inconsistency degree estimator.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源