论文标题
一项关于大数据采样和分析的调查(技术报告)
A Survey on Sampling and Profiling over Big Data (Technical Report)
论文作者
论文摘要
由于互联网技术和计算机科学的发展,数据正在以指数级的速度爆炸。大数据为我们带来了新的机会和挑战。一方面,我们可以分析和挖掘大数据以发现隐藏的信息并获得更多的潜在价值。另一方面,大数据的5V特征,尤其是卷,这意味着大量数据为存储和处理带来了挑战。对于某些传统的数据挖掘算法,机器学习算法和数据分析任务,很难处理如此大量的数据。大量数据是高度要求的硬件资源和耗时的。采样方法可以有效地减少数据量并有助于加快数据处理。因此,采样技术已在大数据上下文中进行了广泛研究和使用,例如确定样本量,将采样与大数据处理框架结合的方法。数据分析是找到数据集元数据的活动,并且具有许多用例,例如,在关系数据,图形数据和时间序列数据上执行数据分析任务,以进行异常检测和数据修复。但是,数据分析在计算上是昂贵的,尤其是对于大型数据集而言。因此,本文着重于研究大数据上下文中的采样和分析,并研究了在不同类别的数据分析任务中采样的应用。从这些研究的实验结果中,从采样数据中得出的结果接近甚至超过了全部数据的结果。因此,抽样技术在大数据时代起着重要作用,我们也有理由相信,采样技术将来将成为大数据处理中必不可少的一步。
Due to the development of internet technology and computer science, data is exploding at an exponential rate. Big data brings us new opportunities and challenges. On the one hand, we can analyze and mine big data to discover hidden information and get more potential value. On the other hand, the 5V characteristic of big data, especially Volume which means large amount of data, brings challenges to storage and processing. For some traditional data mining algorithms, machine learning algorithms and data profiling tasks, it is very difficult to handle such a large amount of data. The large amount of data is highly demanding hardware resources and time consuming. Sampling methods can effectively reduce the amount of data and help speed up data processing. Hence, sampling technology has been widely studied and used in big data context, e.g., methods for determining sample size, combining sampling with big data processing frameworks. Data profiling is the activity that finds metadata of data set and has many use cases, e.g., performing data profiling tasks on relational data, graph data, and time series data for anomaly detection and data repair. However, data profiling is computationally expensive, especially for large data sets. Therefore, this paper focuses on researching sampling and profiling in big data context and investigates the application of sampling in different categories of data profiling tasks. From the experimental results of these studies, the results got from the sampled data are close to or even exceed the results of the full amount of data. Therefore, sampling technology plays an important role in the era of big data, and we also have reason to believe that sampling technology will become an indispensable step in big data processing in the future.