论文标题
学员:使用反事实推理调试和修复错误配置
CADET: Debugging and Fixing Misconfigurations using Counterfactual Reasoning
论文作者
论文摘要
现代计算平台具有高度配置,具有数千种相互作用的配置。但是,配置这些系统具有挑战性。错误的配置可能会导致意外的非功能故障。本文提出了学员(因果调试工具包的缩写),使用户能够以原则上的方式识别,解释和修复非功能性故障的根本原因。学员通过在不同的配置下观察系统的性能来构建因果模型。然后,它使用休闲路径提取,然后在因果模型上进行反事实推理,以:(a)确定非功能性故障的根本原因,(b)估计各种可配置参数对性能目标的影响,以及(c)(c)规定候选人对相关的配置选项的修复以固定非功能性故障。我们评估了在3个NVIDIA JETSON系统中部署的5个高度可配置的系统上的学员。我们将学员与最先进的配置优化和基于ML的调试方法进行比较。实验结果表明,与其他基于ML的基于ML的性能调试方法相比,学员可以在多个非功能性特性中找到有效的维修,以(最多)高度提高17%,增益高28%,$ 40 \ times $加速。与多目标优化方法相比,学员可以更快地找到$ 9 \ times $的修复程序,具有可比或更好的性能增益。我们对NVIDIA论坛上报道的非功能错误的案例研究表明,学员可以在不到30分钟的时间内找到比专家的建议更好14%的维修费用。
Modern computing platforms are highly-configurable with thousands of interacting configurations. However, configuring these systems is challenging. Erroneous configurations can cause unexpected non-functional faults. This paper proposes CADET (short for Causal Debugging Toolkit) that enables users to identify, explain, and fix the root cause of non-functional faults early and in a principled fashion. CADET builds a causal model by observing the performance of the system under different configurations. Then, it uses casual path extraction followed by counterfactual reasoning over the causal model to: (a) identify the root causes of non-functional faults, (b) estimate the effects of various configurable parameters on the performance objective(s), and (c) prescribe candidate repairs to the relevant configuration options to fix the non-functional fault. We evaluated CADET on 5 highly-configurable systems deployed on 3 NVIDIA Jetson systems-on-chip. We compare CADET with state-of-the-art configuration optimization and ML-based debugging approaches. The experimental results indicate that CADET can find effective repairs for faults in multiple non-functional properties with (at most) 17% more accuracy, 28% higher gain, and $40\times$ speed-up than other ML-based performance debugging methods. Compared to multi-objective optimization approaches, CADET can find fixes (at most) $9\times$ faster with comparable or better performance gain. Our case study of non-functional faults reported in NVIDIA's forum show that CADET can find $14%$ better repairs than the experts' advice in less than 30 minutes.