组成数据的数据增强：提前微生物组的预测模型

论文标题

组成数据的数据增强：提前微生物组的预测模型

Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome

论文作者

Gordon-Rodriguez, Elliott, Quinn, Thomas P., Cunningham, John P.

论文摘要

数据增强在现代机器学习管道中起关键作用。尽管已经在计算机视觉和自然语言处理的背景下研究了许多增强策略，但其他数据模式知之甚少。我们的工作将数据增强的成功扩展到了组成数据，即单纯形值数据，这在人类微生物组的背景下特别感兴趣。利用组成数据分析的关键原理，例如单纯形和子组合的Aitchison几何形状，我们为这种数据模式定义了新颖的增强策略。将我们的数据增强纳入标准监督的学习管道中，从而在广泛的标准基准数据集中均可稳定地提高性能。特别是，我们为关键疾病预测任务设置了新的最新技术，包括结直肠癌，2型糖尿病和克罗恩病。此外，我们的数据增加使我们能够定义一种新颖的对比学习模型，该模型改进了以前的微生物组组成数据的表示方法。我们的代码可在https://github.com/cunningham-lab/augcoda上找到。

Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in the context of the human microbiome. Drawing on key principles from compositional data analysis, such as the Aitchison geometry of the simplex and subcompositions, we define novel augmentation strategies for this data modality. Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn's disease. In addition, our data augmentations enable us to define a novel contrastive learning model, which improves on previous representation learning approaches for microbiome compositional data. Our code is available at https://github.com/cunningham-lab/AugCoDa.

下载PDF全文

下载文献需遵守相关版权规定

论文标题