论文标题
全基因组关联研究中的多次测试通过分层隐藏的马尔可夫模型
Multiple Testing in Genome-Wide Association Studies via Hierarchical Hidden Markov Models
论文作者
论文摘要
现代科学研究经常遇到大规模多重测试的问题。传统的多个测试程序通常由于缺乏测试之间的相关性而遭受大量测试效率的损失。实际上,适当使用相关信息不仅增强了多重测试的疗效,而且还提高了结果的解释性。由于疾病或特征相关的单核苷酸多态性(SNP)通常倾向于聚集并表现出串行相关性,因此基于隐藏的马尔可夫模型(HMM)多重测试程序已成功地应用于全基因组关联研究(GWAS)中。重要的是要注意,使用一个HMM对整个染色体进行建模有些粗糙。为了克服这个问题,本文采用层次隐藏的马尔可夫模型(HHMM)来描述测试之间的局部相关性,并开发了多个测试程序,不仅可以自动将不同类别的染色体区域分配,还可以考虑测试之间的局部相关性。从理论上讲,这表明所提出的多个测试过程在某种意义上是有效且最佳的。然后开发了一个数据驱动的过程来模仿Oracle版本。广泛的模拟和实际数据分析表明,新颖的多重测试程序的表现优于其竞争对手。
The problems of large-scale multiple testing are often encountered in modern scientific researches. Conventional multiple testing procedures usually suffer considerable loss of testing efficiency due to the lack of consideration of correlations among tests. In fact, the appropriate use of correlation information not only enhances the efficacy of multiple testing but also improves the interpretability of the results. Since the disease- or trait-related single nucleotide polymorphisms (SNPs) often tend to be clustered and exhibit serial correlations, the hidden Markov model (HMM) based multiple testing procedure has been successfully applied in genome-wide association studies (GWAS). It is important to note that modeling the entire chromosome using one HMM is somewhat rough. To overcome this issue, this paper employs the hierarchical hidden Markov model (HHMM) to describe local correlations among tests and develops a multiple testing procedure that can not only automatically divide different class of chromosome regions, but also takes into account local correlations among tests. Theoretically, it is shown that the proposed multiple testing procedure is valid and optimal in some sense. Then a data-driven procedure is developed to mimic the oracle version. Extensive simulations and the real data analysis show that the novel multiple testing procedure outperforms its competitors.