在选择特征选择方法中利用稳定标准在微生物组数据中产生可重复的结果

论文标题

在选择特征选择方法中利用稳定标准在微生物组数据中产生可重复的结果

Utilizing stability criteria in choosing feature selection methods yields reproducible results in microbiome data

论文作者

Jiang, Lingjing, Haiminen, Niina, Carrieri, Anna-Paola, Huang, Shi, Vazquez-Baeza, Yoshiki, Parida, Laxmi, Kim, Ho-Cheol, Swafford, Austin D., Knight, Rob, Natarajan, Loki

论文摘要

在微生物组数据分析中，特征选择是必不可少的，但是由于微生物组数据集具有高维，不确定的，稀疏和组成，因此它可能特别具有挑战性。最近为开发用于处理上述数据特征的特征选择的新方法做出了巨大努力，但是基于模型预测的性能，几乎所有方法均已评估。但是，很少有人注意解决一个基本问题：这些评估标准的合适性如何？大多数特征选择方法通常控制模型拟合，但是不能简单地基于预测准确性来评估有意义的特征子集的能力。如果对训练数据的微小变化会导致所选特征子集的巨大变化，那么算法发现的许多生物学特征可能是数据伪像，而不是真实的生物学信号。确定相关和可重复性特征的至关重要的需要促进了可重复性评估标准（例如稳定性），该标准量化了方法在数据中扰动的鲁棒性如何。在我们的论文中，我们比较了流行模型预测度量MSE的性能和提出的可重复性标准稳定性，以评估模拟和实验性微生物组应用中四种广泛使用的特征选择方法。我们得出的结论是，稳定性是比MSE的首选特征选择标准，因为它可以更好地量化特征选择方法的可重复性。

Feature selection is indispensable in microbiome data analysis, but it can be particularly challenging as microbiome data sets are high-dimensional, underdetermined, sparse and compositional. Great efforts have recently been made on developing new methods for feature selection that handle the above data characteristics, but almost all methods were evaluated based on performance of model predictions. However, little attention has been paid to address a fundamental question: how appropriate are those evaluation criteria? Most feature selection methods often control the model fit, but the ability to identify meaningful subsets of features cannot be evaluated simply based on the prediction accuracy. If tiny changes to the training data would lead to large changes in the chosen feature subset, then many of the biological features that an algorithm has found are likely to be a data artifact rather than real biological signal. This crucial need of identifying relevant and reproducible features motivated the reproducibility evaluation criterion such as Stability, which quantifies how robust a method is to perturbations in the data. In our paper, we compare the performance of popular model prediction metric MSE and proposed reproducibility criterion Stability in evaluating four widely used feature selection methods in both simulations and experimental microbiome applications. We conclude that Stability is a preferred feature selection criterion over MSE because it better quantifies the reproducibility of the feature selection method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题