论文标题

Exathlon:通过时间序列可解释异常检测的基准

Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series

论文作者

Jacob, Vincent, Song, Fei, Stiegler, Arnaud, Rad, Bijan, Diao, Yanlei, Tatbul, Nesime

论文摘要

在许多实验研究领域中,获得高质量的数据存储库和基准的访问方面有助于推进最新技术。尽管高级分析任务随时间序列数据引起了很多关注,但缺乏这样的社区资源严重限制了科学进步。在本文中,我们介绍了Exathlon,这是高维时间序列数据上可解释异常检测的第一个全面的公共基准。 Exathlon是根据Apache Spark群集上的大规模流处理作业的重复执行的实际数据跟踪系统构建的。其中一些执行有意地通过介绍了六种不同类型的异常事件的实例(例如,行为不当输入,资源争夺,过程失败)。对于每个异常实例,提供了根本原因间隔的地面真实标签以及扩展效应间隔的地面真实标签,从而支持广泛的异常检测(AD)和解释发现(ED)任务的开发和评估。我们通过使用三种最先进的AD和ED技术的实验研究来证明Exathlon数据集,评估方法和端到端数据科学管道设计的实际实用性。

Access to high-quality data repositories and benchmarks have been instrumental in advancing the state of the art in many experimental research domains. While advanced analytics tasks over time series data have been gaining lots of attention, lack of such community resources severely limits scientific progress. In this paper, we present Exathlon, the first comprehensive public benchmark for explainable anomaly detection over high-dimensional time series data. Exathlon has been systematically constructed based on real data traces from repeated executions of large-scale stream processing jobs on an Apache Spark cluster. Some of these executions were intentionally disturbed by introducing instances of six different types of anomalous events (e.g., misbehaving inputs, resource contention, process failures). For each of the anomaly instances, ground truth labels for the root cause interval as well as those for the extended effect interval are provided, supporting the development and evaluation of a wide range of anomaly detection (AD) and explanation discovery (ED) tasks. We demonstrate the practical utility of Exathlon's dataset, evaluation methodology, and end-to-end data science pipeline design through an experimental study with three state-of-the-art AD and ED techniques.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源