论文标题
GenStore:用于基因组序列分析的高性能和节能内存储计算系统
GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis
论文作者
论文摘要
在许多基因组应用中,读取映射是一个基本而又廉价的一步。它用于识别测序基因组和已知基因组(称为参考基因组)的片段(称为读数)之间的潜在匹配和差异。为了应对基因组分析中的计算挑战,许多先前的工作提出了各种方法,例如选择必须经历昂贵计算,有效启发式方法和硬件加速的读取的过滤器。尽管有效地降低了计算开销,但所有这些方法仍然需要大量数据从存储到其余系统的昂贵移动,这可能会大大降低常规和新兴基因组学系统中阅读映射的端到端性能。 我们提出了GenStore,这是第一个用于基因组序列分析的存储内部处理系统,该系统通过利用低成本和准确的存储过滤器来大大降低基因组序列分析的数据运动和计算开销。 GenStore利用硬件/软件共同设计来应对存储后处理的挑战,以1)不同的读取长度和错误率,以及2)不同程度的遗传变异。通过对读取过程的严格分析,我们精心设计了基于NAND Flash的SSD内部的低成本硬件加速器和数据/计算流。我们使用广泛的实际基因组数据集的评估表明,在三个现代SSD中实施时,GenStore可显着提高最先进的软件(硬件)基准的读取映射性能,以2.07-6.05 $ \ times $ \ times $(1.52-3.32 $ \ times $)与参考基因组和1.45-33. 33. (2.70-19.2 $ \ times $)对于与参考基因组相似的读取集。
Read mapping is a fundamental, yet computationally-expensive step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). To address the computational challenges in genome analysis, many prior works propose various approaches such as filters that select the reads that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the computation overhead, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different read lengths and error rates, and 2) different degrees of genetic variation. Through rigorous analysis of read mapping processes, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based SSD. Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05$\times$ (1.52-3.32$\times$) for read sets with high similarity to the reference genome and 1.45-33.63$\times$ (2.70-19.2$\times$) for read sets with low similarity to the reference genome.