概率蛋白序列模型的生成能力

论文标题

概率蛋白序列模型的生成能力

Generative Capacity of Probabilistic Protein Sequence Models

论文作者

McGee, Francisco, Novinger, Quentin, Levy, Ronald M., Carnevale, Vincenzo, Haldane, Allan

论文摘要

POTTS模型和变异自动编码器（VAE）最近作为生成蛋白序列模型（GPSM）越来越受欢迎，以探索适应性景观并预测突变的效果。尽管结果令人鼓舞，但仍缺乏定量表征和GPSM生成的概率分布的比较。目前尚不清楚GPSM是否可以忠实地重现由于上症引起的自然序列中观察到的复杂的多重沉积突变模式。我们开发了一组序列统计数据，以评估三个最近感兴趣的GPSM的“生成能力”：使用天然和合成数据集，成对的Potts Hamiltonian，VAE和与站点无关的模型。我们表明，potts hamiltonian模型的生成能力是最大的，因为该模型产生的高阶突变统计量与自然序列观察到的统计数据一致。相比之下，我们表明VAE的生成能力位于成对的Potts和站点独立的模型之间。重要的是，我们的工作根据我们已经开发的高阶序列协方差统计来衡量GPSM生成能力，并为评估和解释GPSM准确性提供了一个新的框架，该框架强调了上静脉的作用。

Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict the effect of mutations. Despite encouraging results, quantitative characterization and comparison of GPSM-generated probability distributions is still lacking. It is currently unclear whether GPSMs can faithfully reproduce the complex multi-residue mutation patterns observed in natural sequences arising due to epistasis. We develop a set of sequence statistics to assess the "generative capacity" of three GPSMs of recent interest: the pairwise Potts Hamiltonian, the VAE, and the site-independent model, using natural and synthetic datasets. We show that the generative capacity of the Potts Hamiltonian model is the largest, in that the higher order mutational statistics generated by the model agree with those observed for natural sequences. In contrast, we show that the VAE's generative capacity lies between the pairwise Potts and site-independent models. Importantly, our work measures GPSM generative capacity in terms of higher-order sequence covariation statistics which we have developed, and provides a new framework for evaluating and interpreting GPSM accuracy that emphasizes the role of epistasis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题