论文标题
核心概论:更新的调查
Introduction to Core-sets: an Updated Survey
论文作者
论文摘要
在优化或机器学习问题中,我们给出了一组项目,通常是在某些度量空间中的点,而目标是在候选解决方案的某些空间上最小化或最大化目标函数。例如,在聚类问题中,输入是某些度量空间中的一组点,一个共同的目标是计算其他一些空间(点,线)中的一组中心,以最大程度地减少这些点的距离之和。在数据库查询中,我们可能需要为特定查询集的$ k $中心计算一些此类。 但是,传统算法无法处理需要从传感器,例如GPS,音频或视频等传感器的无限分布式流进行实时计算的现代系统,这些系统或视频到达云的网络,或智能手机或机器人等较弱设备的网络。 Core-Set是输入“大数据”的“小数据”摘要,其中每个可能的查询在两个数据集上都具有相同的答案。通用技术实现了有效的核心\更改了流,分布式和动态数据的{维护}。然后可以将传统算法应用于这些核心,以维持近似的最佳解决方案。 面临的挑战是设计具有可证明其大小和近似误差之间可证明权衡的核心。这项调查以回顾性的方式总结了此类构造,旨在统一和简化最新的构造。
In optimization or machine learning problems we are given a set of items, usually points in some metric space, and the goal is to minimize or maximize an objective function over some space of candidate solutions. For example, in clustering problems, the input is a set of points in some metric space, and a common goal is to compute a set of centers in some other space (points, lines) that will minimize the sum of distances to these points. In database queries, we may need to compute such a some for a specific query set of $k$ centers. However, traditional algorithms cannot handle modern systems that require parallel real-time computations of infinite distributed streams from sensors such as GPS, audio or video that arrive to a cloud, or networks of weaker devices such as smartphones or robots. Core-set is a "small data" summarization of the input "big data", where every possible query has approximately the same answer on both data sets. Generic techniques enable efficient coreset \changed{maintenance} of streaming, distributed and dynamic data. Traditional algorithms can then be applied on these coresets to maintain the approximated optimal solutions. The challenge is to design coresets with provable tradeoff between their size and approximation error. This survey summarizes such constructions in a retrospective way, that aims to unified and simplify the state-of-the-art.