论文标题
从稳定性测试中的日志中指出异常事件 - n-grams与深度学习
Pinpointing Anomaly Events in Logs from Stability Testing -- N-Grams vs. Deep-Learning
论文作者
论文摘要
由于稳定性测试执行日志可能很长,因此软件工程师需要帮助定位异常事件。我们开发和评估两个模型,以评估各个日志事件的异常,即N-gram模型和具有LSTM的深度学习模型(长期短期记忆)。两者仅对正常日志序列进行训练。我们在公司案例中使用Android稳定性测试的较长日志序列以及HDFS(Hadoop分布式文件系统)公共数据集的简短日志序列评估模型。我们评估下一个事件预测准确性和计算效率。 LSTM模型在稳定性测试日志中更准确(0.848 vs 0.865),而在HDFS日志中,N-gram稍准确(0.904 vs 0.900)。与深层模型相比,N-Gram模型具有较高的计算效率(4到13秒,而16分钟到近4小时),使其成为我们案例公司的首选选择。为各种日志事件评分异常,似乎是对测试用例失败的根本原因分析的很好的帮助,而我们的案例公司计划将其添加到其在线服务中。尽管最近在软件系统异常检测中使用深度学习的激增,但我们发现这样做的好处有限。但是,未来的工作应该考虑我们的发现是否与不同的LSTM模型超参数,其他数据集以及其他深入学习方法相比,与基于LSTM的模型相比,这些方法有望更好地准确性和计算效率。
As stability testing execution logs can be very long, software engineers need help in locating anomalous events. We develop and evaluate two models for scoring individual log-events for anomalousness, namely an N-Gram model and a Deep Learning model with LSTM (Long short-term memory). Both are trained on normal log sequences only. We evaluate the models with long log sequences of Android stability testing in our company case and with short log sequences from HDFS (Hadoop Distributed File System) public dataset. We evaluate next event prediction accuracy and computational efficiency. The LSTM model is more accurate in stability testing logs (0.848 vs 0.865), whereas in HDFS logs the N-Gram is slightly more accurate (0.904 vs 0.900). The N-Gram model has far superior computational efficiency compared to the Deep model (4 to 13 seconds vs 16 minutes to nearly 4 hours), making it the preferred choice for our case company. Scoring individual log events for anomalousness seems like a good aid for root cause analysis of failing test cases, and our case company plans to add it to its online services. Despite the recent surge in using deep learning in software system anomaly detection, we found limited benefits in doing so. However, future work should consider whether our finding holds with different LSTM-model hyper-parameters, other datasets, and with other deep-learning approaches that promise better accuracy and computational efficiency than LSTM based models.