近乎最佳的核心，用于稳健的聚类

论文标题

近乎最佳的核心，用于稳健的聚类

Near-optimal Coresets for Robust Clustering

论文作者

Huang, Lingxiao, Jiang, Shaofeng H. -C., Lou, Jianing, Wu, Xuan

论文摘要

我们考虑在$ \ mathbb {r}^d $中，特别是$ k $ - 群集问题（例如，$ k $ -Median和$ k $ -means和$ m $ utliers的成本，给定中心的成本$ c \ c \ subset \ subset \ mathbb {r}^d $ cogethess yterty the Prote coptions，经典集群显着改善了对$（M + K）$的指数依赖性[Feldman and Schulman，Soda'12]，或者具有较弱的双标准保证[Huang et al。，focs'18]我们的腔体适应了最新的框架[Braverman等人，focs'22]，该框架设计用于容量受限的集群，克服了一个新的挑战，即参与成本中的参与术语，尤其是排除的$ M $ M $ Outllyllillielllielllielllielplient，包括我们的基本量$ c $，以及我们的大量销量。均匀的采样和灵敏度采样。

We consider robust clustering problems in $\mathbb{R}^d$, specifically $k$-clustering problems (e.g., $k$-Median and $k$-Means with $m$ outliers, where the cost for a given center set $C \subset \mathbb{R}^d$ aggregates the distances from $C$ to all but the furthest $m$ data points, instead of all points as in classical clustering. We focus on the $ε$-coreset for robust clustering, a small proxy of the dataset that preserves the clustering cost within $ε$-relative error for all center sets. Our main result is an $ε$-coreset of size $O(m + \mathrm{poly}(k ε^{-1}))$ that can be constructed in near-linear time. This significantly improves previous results, which either suffers an exponential dependence on $(m + k)$ [Feldman and Schulman, SODA'12], or has a weaker bi-criteria guarantee [Huang et al., FOCS'18]. Furthermore, we show this dependence in $m$ is nearly-optimal, and the fact that it is isolated from other factors may be crucial for dealing with large number of outliers. We construct our coresets by adapting to the outlier setting a recent framework [Braverman et al., FOCS'22] which was designed for capacity-constrained clustering, overcoming a new challenge that the participating terms in the cost, particularly the excluded $m$ outlier points, are dependent on the center set $C$. We validate our coresets on various datasets, and we observe a superior size-accuracy tradeoff compared with popular baselines including uniform sampling and sensitivity sampling. We also achieve a significant speedup of existing approximation algorithms for robust clustering using our coresets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题