分层相对LEMPEL-ZIV压缩

论文标题

分层相对LEMPEL-ZIV压缩

Hierarchical Relative Lempel-Ziv Compression

论文作者

Bille, Philip, Gørtz, Inge Li, Puglisi, Simon J., Tarnow, Simon R.

论文摘要

相对LEMPEL-ZIV（RLZ）解析是一种字典压缩方法，其中$ s $相对于第二个字符串$ r $（称为参考）通过将$ s $解析为$ r $中的一系列子字符串。 RLZ在压缩与参考字符串高度相似的字符串方面特别有效，例如来自同一物种的个体的一组基因组。随着目前廉价的DNA测序成本，此类数据集变得极为丰富，并且正在迅速增长。在本文中，我们没有为整个集合使用单个参考字符串，而是研究了不同参考字符串作为集合子集的使用，以改善压缩。特别是，我们在字符串上形成一个生根的树（或层次结构），然后使用rlz和parent作为参考来压缩每个字符串，仅在纯文本中存储树的根。要解压缩，我们以BFS顺序从根部开始穿越树，使孩子相对于父母进行解压缩。我们表明，这种方法导致细菌基因组数据集的压缩有双重改善，与标准的单个参考方法相比，对减压时间的影响可忽略不计。我们表明，可以通过计算完整的加权挖掘图的最佳树木构建给定字符串的有效层次结构，其权重为源和目标顶点的RLZ解析中的短语数量。我们进一步表明，使用局部性敏感的哈希（Hashing）得出的稀疏图不是计算完整的图形，可以显着降低计算良好层次结构的成本，而不会产生不利影响压缩性能。

Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string $S$ is compressed relative to a second string $R$ (called the reference) by parsing $S$ into a sequence of substrings that occur in $R$. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such data sets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we form a rooted tree (or hierarchy) on the strings and then compressed each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome data sets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题