论文标题
具有相同群集查询的混合簇的精确恢复
Exact Recovery of Mangled Clusters with Same-Cluster Queries
论文作者
论文摘要
我们研究了半监督的主动聚类框架中的群集恢复问题。给定一组有限的输入点,甲骨文揭示了是否有两个点是否位于同一集群中,我们的目标是恢复所有精确使用尽可能少的查询。为此,我们放宽了Ashtiani等人的球形$ K $ -MEANS群集假设,以允许有余量的任意椭圆形簇。这消除了以下假设:聚类是基于中心的(即通过优化问题定义的),并且包括所有通过旋转,轴尺度和点缺失组合单独转换球形簇的情况。我们表明,即使在更一般的设置中,仍然可以使用许多仅与输入点数对数进行对数的查询精确恢复潜在聚类。更准确地说,我们设计了一种算法,在给定$ n $点以$ k $ clusters中,使用$ o(k^3 \ ln k \ ln n)$ oracle Queries和$ \ tilde {o}(kn + k^3)$时间将聚类恢复为零错误分类错误。 $ o(\ cdot)$表示法隐藏了对簇的维度的指数依赖性,我们表明这是必要的,从而表征了问题的查询复杂性。我们的算法简单,易于实现,还可以使用低伸展分离器,一类具有额外理论保证的椭球。大型合成数据集的实验证实我们可以精确有效地重建聚类。
We study the cluster recovery problem in the semi-supervised active clustering framework. Given a finite set of input points, and an oracle revealing whether any two points lie in the same cluster, our goal is to recover all clusters exactly using as few queries as possible. To this end, we relax the spherical $k$-means cluster assumption of Ashtiani et al.\ to allow for arbitrary ellipsoidal clusters with margin. This removes the assumption that the clustering is center-based (i.e., defined through an optimization problem), and includes all those cases where spherical clusters are individually transformed by any combination of rotations, axis scalings, and point deletions. We show that, even in this much more general setting, it is still possible to recover the latent clustering exactly using a number of queries that scales only logarithmically with the number of input points. More precisely, we design an algorithm that, given $n$ points to be partitioned into $k$ clusters, uses $O(k^3 \ln k \ln n)$ oracle queries and $\tilde{O}(kn + k^3)$ time to recover the clustering with zero misclassification error. The $O(\cdot)$ notation hides an exponential dependence on the dimensionality of the clusters, which we show to be necessary thus characterizing the query complexity of the problem. Our algorithm is simple, easy to implement, and can also learn the clusters using low-stretch separators, a class of ellipsoids with additional theoretical guarantees. Experiments on large synthetic datasets confirm that we can reconstruct clusterings exactly and efficiently.