论文标题
SDCOR:在大规模数据集中用于本地离群值检测的可扩展密度聚类
SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets
论文作者
论文摘要
本文提出了一种基于批量密度的聚类方法,用于大规模数据集中的本地离群值检测。与众所周知的传统算法假定所有数据都是存储器居民的不同,我们提出的方法是可扩展的,并在有限的内存缓冲区的范围内处理输入数据块。在第一阶段构建了临时聚类模型;然后,通过分析连续的积分内存负载来逐渐更新。随后,在可扩展聚类的结束时,获得了原始簇的近似结构。最后,通过对整个数据集的另一项扫描,并使用合适的标准,将外围分数分配给每个称为SDCOR的对象(基于可扩展密度的群集异常值比率)。对现实生活和合成数据集的评估表明,与最著名的基于常规密度的方法相比,该方法的线性时间复杂性较低,并且更有效,更有效,这些方法需要将所有数据加载到存储器中;而且,对于一些基于快速距离的方法,这些方法可以在磁盘中驻留的数据上执行。
This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Unlike the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the input data chunk-by-chunk within the confines of a limited memory buffer. A temporary clustering model is built at the first phase; then, it is gradually updated by analyzing consecutive memory loads of points. Subsequently, at the end of scalable clustering, the approximate structure of the original clusters is obtained. Finally, by another scan of the entire dataset and using a suitable criterion, an outlying score is assigned to each object called SDCOR (Scalable Density-based Clustering Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity and is more effective and efficient compared to best-known conventional density-based methods, which need to load all data into the memory; and also, to some fast distance-based methods, which can perform on data resident in the disk.