可扩展的分层聚集聚类

论文标题

可扩展的分层聚集聚类

Scalable Hierarchical Agglomerative Clustering

论文作者

Monath, Nicholas, Dubey, Avinava, Guruganesh, Guru, Zaheer, Manzil, Ahmed, Amr, McCallum, Andrew, Mergen, Gokhan, Najork, Marc, Terzihan, Mert, Tjanaka, Bryon, Wang, Yuan, Wu, Yuchen

论文摘要

聚集聚类的适用性，用于推断层次结构和平面聚类，受其可扩展性的限制。现有的可扩展分层聚类方法牺牲了质量以速度，并经常导致集群过度合并。在本文中，我们提出了一种用于分层聚类的可扩展的，团聚的方法，该方法不会牺牲质量和数十亿个数据点。我们进行了详细的理论分析，表明在轻度的可分离性条件下，我们的算法不仅可以恢复最佳的平面分区，而且还可以对非参数DP均值目标提供两种易于抗性。这引入了分层聚类作为非参数聚类目标的近似算法的新应用。我们还将算法与经典的分层聚集聚类方法联系起来。我们在层次结构和扁平聚类设置中进行了广泛的经验实验，并表明我们提出的方法在公开可用的聚类基准方面取得了最新的结果。最后，我们通过将其应用于300亿查询的数据集来证明我们的方法的可伸缩性。人类对发现簇的评估表明，我们的方法比当前的最新方法找到了簇的质量。

The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition, but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题