Bioinformatický seminár

Tue 11 Oct. 2011, 17:20

Title: Kozanitis et al. Compressing Genomic Sequence Fragments Using SlimGene
Speaker: Martin Kravec

Abstract With the advent of next generation sequencing technologies, the
cost of sequencing whole genomes is poised to go below \$1000 per human
individual in a few years. As more and more genomes are sequenced,
analysis methods are undergoing rapid development, making it tempting to
store sequencing data for long periods of time so that the data can be
re-analyzed with the latest techniques. The challenging open research
problems, huge influx of data, and rapidly improving analysis techniques
have created the need to store and transfer very large volumes of data.
Compression can be achieved at many levels, including trace level
(compressing image data), sequence level (compressing a genomic sequence),
and fragment-level (compressing a set of short, redundant fragment reads,
along with quality-values on the base-calls). We focus on fragment-level
compression, which is the pressing need today. Our article makes two
contributions, implemented in a tool, SlimGene. First, we introduce a set
of domain specific loss-less compression schemes that achieve over 40x
compression of fragments, outperforming bzip2 by over 6x. Including
quality values, we show a 5x compression using less running time than
bzip2. Second, given the discrepancy between the compression factor
obtained with and without quality values, we initiate the study of using
\"lossy\" quality values. Specifically, we show that a lossy quality value
quantization results in 14x compression but has minimal impact on
downstream applications like SNP calling that use the quality values.
Discrepancies between SNP calls made between the lossy and loss-less
versions of the data are limited to low coverage areas where even the SNP
calls made by the loss-less version are marginal.