论文标题
使用离散的草图数据和覆盖范围的共形频率估计
Conformal Frequency Estimation using Discrete Sketched Data with Coverage for Distinct Queries
论文作者
论文摘要
本文基于具有较低内存足迹的草图,开发了在非常大的离散数据集中在非常大的离散数据集中构建置信区间的共形推理方法。这种方法不需要对数据分布的了解,并且可以与任何草图算法结合使用,包括但不限于著名的计数米草图,计数 - 凯奇及其变化。在解释了如何实现可交换随机查询的边际覆盖范围之后,我们扩展了解决方案,以提供更强大的推论,可以解释数据的离散性和异质性查询频率,从而提高了可能的分布转移的稳健性。这些结果通过一种新型的保形校准技术促进了这些结果,该技术可确保对大量不同的随机查询的有效覆盖范围。最后,我们显示我们的方法在模拟中以及文本和SARS-COV-2 DNA数据的示例中与现有的常见主义者和贝叶斯替代方案相比,经验性能提高了。
This paper develops conformal inference methods to construct a confidence interval for the frequency of a queried object in a very large discrete data set, based on a sketch with a lower memory footprint. This approach requires no knowledge of the data distribution and can be combined with any sketching algorithm, including but not limited to the renowned count-min sketch, the count-sketch, and variations thereof. After explaining how to achieve marginal coverage for exchangeable random queries, we extend our solution to provide stronger inferences that can account for the discreteness of the data and for heterogeneous query frequencies, increasing also robustness to possible distribution shifts. These results are facilitated by a novel conformal calibration technique that guarantees valid coverage for a large fraction of distinct random queries. Finally, we show our methods have improved empirical performance compared to existing frequentist and Bayesian alternatives in simulations as well as in examples of text and SARS-CoV-2 DNA data.