Bioinformatický seminár

Tue 12 Oct. 2010, 17:20

Title: Brandon et al. Data structures and compression algorithms for genomic sequence data
Speaker: Lenka Trojáková, Ivana Seleceniová

MOTIVATION: The continuing exponential accumulation of full genome data,
including full diploid human genomes, creates new challenges not only for
understanding genomic structure, function and evolution, but also for the
storage, navigation and privacy of genomic data. Here, we develop data
structures and algorithms for the efficient storage of genomic and other
sequence data that may also facilitate querying and protecting the data.
RESULTS: The general idea is to encode only the differences between a
genome sequence and a reference sequence, using absolute or relative
coordinates for the location of the differences. These locations and the
corresponding differential variants can be encoded into binary strings
using various entropy coding methods, from fixed codes such as Golomb and
Elias codes, to variables codes, such as Huffman codes. We demonstrate the
approach and various tradeoffs using highly variables human mitochondrial
genome sequences as a testbed. With only a partial level of optimization,
3615 genome sequences occupying 56 MB in GenBank are compressed down to
only 167 KB, achieving a 345-fold compression rate, using the revised
Cambridge Reference Sequence as the reference sequence. Using the
consensus sequence as the reference sequence, the data can be stored using
only 133 KB, corresponding to a 433-fold level of compression, roughly a
23% improvement. Extensions to nuclear genomes and high-throughput
sequencing data are discussed. AVAILABILITY: Data are publicly available
from GenBank, the HapMap web site, and the MITOMAP database. Supplementary
materials with additional results, statistics, and software
implementations are available from